DeepShot: Neural Networks for NBA Shot Prediction and Optimization¶
Welcome to the DeepShot project! In this series of notebooks, we'll explore how deep learning can help us understand, predict, and optimize basketball shots in the NBA.
What We'll Cover¶
- Why shot prediction matters in basketball
- The limitations of traditional basketball analytics
- How deep learning can provide new insights
- Our research questions and goals
- The data and methods we'll use
- What we expect to discover
Project Overview¶
Basketball is a game of decisions. Every time a player has the ball, they need to decide: Should I shoot? Pass? Drive? And if they shoot, what are their chances of making it?
Traditional basketball analytics often uses simple statistics and predetermined court zones to answer these questions. For example, analysts might divide the court into areas like "corner 3," "top of the key," or "restricted area," and calculate shooting percentages for each zone.
But this approach has limitations:
- Arbitrary Zones: The court divisions are created by humans and might not reflect natural shooting patterns
- Missing Context: Simple percentages don't account for defenders, game situation, or player fatigue
- Limited Personalization: Players have unique shooting styles that get averaged out
- Static Analysis: The dynamic flow of the game gets lost
This is where deep learning comes in. Instead of relying on human-defined zones and simple statistics, we can use neural networks to:
- Learn optimal court representations directly from millions of shots
- Capture each player's unique shooting tendencies
- Include game context like score, time remaining, and momentum
- Provide personalized insights for players and teams
Our Research Questions¶
In this project, we're trying to answer several key questions:
How does court location affect shot success? Beyond simple distance, what spatial patterns exist in shooting effectiveness?
What makes each player unique? Can we capture individual shooting tendencies in a way that allows meaningful comparison?
How does game context matter? How do factors like score differential, time remaining, and recent performance affect shooting?
Can we combine these factors effectively? How can we integrate spatial, player, and contextual information to make better predictions?
What strategic insights can we gain? How has NBA shooting strategy evolved, and what patterns lead to success?
Our Data¶
We're using three comprehensive datasets:
NBA Shots Dataset: Over 4.2 million shots from 2004-2024, including details on shot location, type, and outcome. This is our primary dataset.
NBA Injury Stats Dataset: 23,450 injuries from 1951-2023, which helps us understand player availability and performance context.
NBA Team Statistics Dataset: Comprehensive team performance metrics that provide defensive context and enable team-level analysis.
Together, these datasets give us a complete picture of NBA shooting over two decades.
Our Approach¶
We're using several deep learning techniques:
- Convolutional Neural Networks (CNNs) to process spatial shot data and identify location-specific patterns.
Think of this like a basketball coach with perfect memory who has watched millions of shots from every spot on the court and learned which locations are most effective.
- Player Embeddings to create vector representations of player shooting tendencies.
Imagine creating a "basketball DNA" for each player that captures their unique shooting style in a way that lets us compare players mathematically.
- Neural Networks to incorporate game context like quarter, time remaining, and score margin.
This is like understanding how the game situation affects shooting - how players perform differently in different quarters or when they're up by 20 versus down by 2.
- Multi-branch Architecture to combine different types of features and capture their interactions.
This helps us understand how spatial, player, and context features work together to influence shot success - like knowing that certain players excel in specific court locations during particular game situations.
By the end of this project, we expect to:
- Build accurate shot prediction models using available data and straightforward deep learning approaches
- Create personalized shot maps showing optimal shooting locations for different players
- Identify key factors that influence shot success in different situations
- Discover strategic insights about basketball shooting patterns
- Develop practical analytical approaches that could help understand shooting performance
Project Structure¶
This project is organized into 16 sections:
- Abstract and Introduction
- Data Collection - Gathering our datasets
- Data Cleaning and Validation - Ensuring data quality
- Data Standardization - Creating consistent formats
- Feature Engineering - Creating useful derived features
- Data Exploration - Spatial Patterns - Understanding court location effects
- Data Exploration - Temporal and Contextual - Analyzing game situation effects
- Spatial Model - Building our CNN for court locations
- Game Context Model - Modeling basic game situation features
- Integrated Model - Combining spatial, player, and context features
- Shot Optimization - Finding optimal shooting strategies
- Integrated Model - Training and Evaluation - Building and testing the model
- Shot Optimization - Finding optimal shooting strategies
- Strategic Insights - Shot Evolution - Analyzing historical trends
- Strategic Insights - Team Analysis - Examining team strategies
- Conclusions and Future Work - Summarizing our findings
Each section builds on the previous ones, creating a comprehensive pipeline from data collection to actionable insights.
Why This Matters¶
This project isn't just an academic exercise. The insights we gain could help:
- Players identify their optimal shooting zones and development opportunities
- Coaches design more effective offensive and defensive strategies
- Teams make better roster construction decisions
- Analysts develop new approaches to understanding the game
Basketball is evolving rapidly, with the three-point revolution and analytics-driven strategies transforming how the game is played. By applying deep learning to this domain, we can discover patterns that traditional analysis might miss and contribute to the next evolution of the sport.
Next Steps¶
Next, we'll start our journey by collecting the data we need for our analysis. We'll set up the Kaggle API, download our datasets, and prepare our directory structure for the project.
DeepShot: Data Collection¶
Introduction¶
Data collection is the foundation of our NBA shot prediction project. In this notebook, we gather comprehensive data on NBA shots, player information, and team statistics to support our analysis and modeling.
The quality and scope of our data directly impact the insights we can derive and the accuracy of our predictive models. For this project, we need data that captures:
- Shot Information: Location, outcome, shooter, game context
- Player Information: Career statistics, position, experience
- Team Information: Performance metrics, playing style, defensive ratings
We've chosen to use Kaggle as our primary data source because it offers well-maintained, comprehensive NBA datasets with the necessary breadth and depth. Using the Kaggle API allows us to programmatically download these datasets, making our process reproducible and updatable.
Data Source Selection¶
When selecting data sources for this project, we considered several factors:
- Comprehensiveness: We need data covering multiple seasons to identify long-term patterns
- Granularity: Shot-level data is required for spatial analysis
- Reliability: Data should come from reputable sources with minimal errors
- Accessibility: Data should be programmatically accessible for reproducibility
The Kaggle datasets we've selected meet these criteria and provide complementary information:
- NBA Shots Dataset: Provides detailed shot-level information
- NBA Injury Stats Dataset: Provides context about player availability
- NBA Team Statistics Dataset: Provides team-level performance metrics
Together, these datasets give us a complete picture of NBA shooting over two decades.
Data Organization¶
Proper organization of our data is essential for an efficient workflow. We've structured our data directory as follows:
- Raw Data: Original, unmodified datasets as downloaded from Kaggle
- Interim Data: Partially processed data that has undergone cleaning but not full processing
- Processed Data: Fully processed, analysis-ready data
This organization follows best practices for data science projects, creating a clear separation between original and processed data while maintaining a record of intermediate steps.
# ##HIDE##
import os
import pandas as pd
from pathlib import Path
import kaggle
data_dir = Path('../data')
raw_dir = data_dir / 'raw'
processed_dir = data_dir / 'processed'
for directory in [data_dir, raw_dir, processed_dir]:
directory.mkdir(parents=True, exist_ok=True)
def download_dataset(dataset, path):
os.makedirs(path, exist_ok=True)
kaggle.api.dataset_download_files(dataset, path=path, unzip=True)
return list(Path(path).glob('*.csv'))
shots_path = raw_dir / 'shots'
shot_files = download_dataset('mexwell/nba-shots', shots_path)
print(f"Downloaded {len(shot_files)} shot data files")
injuries_path = raw_dir / 'injuries'
injury_files = download_dataset('loganlauton/nba-injury-stats-1951-2023', injuries_path)
print(f"Downloaded {len(injury_files)} injury data files")
team_stats_path = raw_dir / 'team_stats'
team_stats_files = download_dataset('sumitrodatta/nba-aba-baa-stats', team_stats_path)
print(f"Downloaded {len(team_stats_files)} team stats files")
Dataset URL: https://www.kaggle.com/datasets/mexwell/nba-shots Downloaded 21 shot data files Dataset URL: https://www.kaggle.com/datasets/loganlauton/nba-injury-stats-1951-2023 Downloaded 1 injury data files Dataset URL: https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats Downloaded 22 team stats files
if shot_files:
shots_sample = pd.read_csv(shot_files[0])
print(f"Shot data: {shots_sample.shape[0]} rows, {shots_sample.shape[1]} columns")
display(shots_sample.head(3))
if injury_files:
injuries_sample = pd.read_csv(injury_files[0])
print(f"Injury data: {injuries_sample.shape[0]} rows, {injuries_sample.shape[1]} columns")
display(injuries_sample.head(3))
if team_stats_files:
team_stats_sample = pd.read_csv(team_stats_files[0])
print(f"Team stats: {team_stats_sample.shape[0]} rows, {team_stats_sample.shape[1]} columns")
display(team_stats_sample.head(3))
Shot data: 199030 rows, 26 columns
| SEASON_1 | SEASON_2 | TEAM_ID | TEAM_NAME | PLAYER_ID | PLAYER_NAME | POSITION_GROUP | POSITION | GAME_DATE | GAME_ID | ... | BASIC_ZONE | ZONE_NAME | ZONE_ABB | ZONE_RANGE | LOC_X | LOC_Y | SHOT_DISTANCE | QUARTER | MINS_LEFT | SECS_LEFT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009 | 2008-09 | 1610612744 | Golden State Warriors | 201627 | Anthony Morrow | G | SG | 04-15-2009 | 20801229 | ... | Restricted Area | Center | C | Less Than 8 ft. | -0.0 | 5.25 | 0 | 4 | 0 | 1 |
| 1 | 2009 | 2008-09 | 1610612744 | Golden State Warriors | 101235 | Kelenna Azubuike | F | SF | 04-15-2009 | 20801229 | ... | Restricted Area | Center | C | Less Than 8 ft. | -0.0 | 5.25 | 0 | 4 | 0 | 9 |
| 2 | 2009 | 2008-09 | 1610612756 | Phoenix Suns | 255 | Grant Hill | F | SF | 04-15-2009 | 20801229 | ... | Restricted Area | Center | C | Less Than 8 ft. | -0.0 | 5.25 | 0 | 4 | 0 | 25 |
3 rows × 26 columns
Injury data: 37667 rows, 6 columns
| Unnamed: 0 | Date | Team | Acquired | Relinquished | Notes | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1951-12-25 | Bullets | NaN | Don Barksdale | placed on IL |
| 1 | 1 | 1952-12-26 | Knicks | NaN | Max Zaslofsky | placed on IL with torn side muscle |
| 2 | 2 | 1956-12-29 | Knicks | NaN | Jim Baechtold | placed on inactive list |
Team stats: 1432 rows, 28 columns
| season | lg | team | abbreviation | playoffs | g | mp | fg_per_100_poss | fga_per_100_poss | fg_percent | ... | ft_percent | orb_per_100_poss | drb_per_100_poss | trb_per_100_poss | ast_per_100_poss | stl_per_100_poss | blk_per_100_poss | tov_per_100_poss | pf_per_100_poss | pts_per_100_poss | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2025 | NBA | Atlanta Hawks | ATL | False | 60 | 14500 | 40.9 | 88.3 | 0.463 | ... | 0.769 | 11.4 | 31.8 | 43.2 | 28.1 | 9.6 | 4.9 | 15.4 | 18.1 | 112.3 |
| 1 | 2025 | NBA | Boston Celtics | BOS | False | 60 | 14525 | 42.6 | 92.5 | 0.461 | ... | 0.796 | 11.1 | 34.8 | 45.9 | 26.4 | 7.4 | 5.7 | 12.2 | 16.7 | 119.7 |
| 2 | 2025 | NBA | Brooklyn Nets | BRK | False | 59 | 14235 | 38.9 | 88.5 | 0.439 | ... | 0.795 | 11.4 | 31.3 | 42.6 | 25.6 | 8.1 | 4.5 | 16.1 | 21.3 | 108.7 |
3 rows × 28 columns
Next Steps¶
With our data successfully collected and organized, we're now ready to proceed to data cleaning and validation. In the next notebook, we'll:
- Inspect the data for quality issues
- Handle missing values and outliers
- Validate data consistency across datasets
- Prepare the data for standardization
The data collection phase has provided us with a rich foundation of over 4.2 million shots, 23,450 injury records, and comprehensive team statistics spanning two decades. This extensive dataset will enable us to build robust models and derive meaningful insights about NBA shooting patterns.
DeepShot: Data Cleaning and Validation¶
Introduction¶
This notebook focuses on validating data quality and performing essential cleaning operations on our NBA shot data. Data cleaning is a critical step in our analysis pipeline as it ensures the reliability and accuracy of our subsequent modeling efforts.
In this notebook, we will:
- Check for missing values across all datasets
- Identify and remove duplicate records
- Detect outliers using the interquartile range (IQR) method
- Convert data types for consistency
- Handle missing values appropriately based on their frequency
We've implemented a custom DataValidator class that systematically identifies data quality issues and applies appropriate cleaning operations. Our approach is to be conservative with data cleaning - we only fill missing values if they represent less than 5% of the data, and we flag outliers for further analysis rather than automatically removing them.
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
data_dir = Path('../data')
raw_dir = data_dir / 'raw'
interim_dir = data_dir / 'interim'
interim_dir.mkdir(parents=True, exist_ok=True)
Data Validation Methodology¶
Our DataValidator class implements a systematic approach to data quality assessment and cleaning. The class has the following key methods:
- check_missing(): Identifies columns with missing values and calculates the percentage of missing data in each column
- check_duplicates(): Identifies duplicate records in the dataset
- check_outliers(): Uses the IQR method to identify potential outliers in numeric columns
- clean(): Applies appropriate cleaning operations based on the identified issues
For missing values, we take a conservative approach:
- If a column has less than 5% missing values, we fill numeric values with the median and categorical values with the mode
- If a column has more than 5% missing values, we preserve the missing values to avoid introducing bias
For outliers, we flag them for further analysis rather than automatically removing them, recognizing that in basketball data, outliers may represent legitimate but rare events.
class DataValidator:
def __init__(self, dataset_name):
self.dataset_name = dataset_name
self.issues = {}
self.actions = []
def check_missing(self, df):
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
self.issues['missing'] = missing[missing > 0]
return missing_pct
def check_duplicates(self, df):
dups = df.duplicated().sum()
self.issues['duplicates'] = dups
return dups
def check_outliers(self, df, numeric_cols=None):
if numeric_cols is None:
numeric_cols = df.select_dtypes(include=[np.number]).columns
outliers = {}
for col in numeric_cols:
Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
count = ((df[col] < lower) | (df[col] > upper)).sum()
if count > 0:
outliers[col] = count
self.issues['outliers'] = outliers
return outliers
def clean(self, df):
df_clean = df.copy()
# Remove duplicates
df_clean = df_clean.drop_duplicates()
if len(df_clean) < len(df):
self.actions.append(f"Removed {len(df) - len(df_clean)} duplicates")
exclude_cols = [
# Shot data columns
'TEAM_NAME', 'PLAYER_NAME', 'HOME_TEAM', 'AWAY_TEAM', 'EVENT_TYPE',
'ACTION_TYPE', 'SHOT_TYPE', 'BASIC_ZONE', 'ZONE_NAME', 'ZONE_ABB',
'ZONE_RANGE', 'POSITION', 'POSITION_GROUP', 'GAME_DATE', 'SEASON_2',
# Team stats columns
'team', 'abbreviation', 'player', 'pos', 'lg', 'tm', 'experience',
'birth_year', 'birth_date', 'college', 'slug', 'arena', 'season',
'playoffs', 'winner', 'replaced', 'type', 'number_tm', 'position',
'hof', 'from', 'to', 'ht_in_in', 'wt',
# Player data columns
'seas_id', 'player_id', 'player', 'pos', 'lg', 'tm', 'season'
]
for col in df_clean.select_dtypes(include=['object']).columns:
if col not in exclude_cols: # Skip name/text columns
try:
# Use loc to avoid SettingWithCopyWarning
df_clean.loc[:, col] = pd.to_numeric(df_clean[col], errors='coerce')
self.actions.append(f"Converted {col} to numeric")
except:
pass
for col in df_clean.columns:
missing_pct = df_clean[col].isnull().mean() * 100
if missing_pct > 0 and missing_pct < 5: # Only fix if < 5% missing
if pd.api.types.is_numeric_dtype(df_clean[col]):
df_clean = df_clean.fillna({col: df_clean[col].median()})
self.actions.append(f"Filled missing values in {col} with median")
else:
df_clean = df_clean.fillna({col: df_clean[col].mode()[0]})
self.actions.append(f"Filled missing values in {col} with mode")
return df_clean
Shot Data Validation¶
shots_path = raw_dir / 'shots'
shot_files = list(shots_path.glob('*.csv'))
if not shot_files:
print("No shot data files found")
else:
print(f"Found {len(shot_files)} shot data files")
all_shots = []
file_summaries = []
for file in shot_files:
season = file.stem.split('_')[1] if '_' in file.stem else 'unknown'
print(f"Processing {file.name} (Season {season})...")
df = pd.read_csv(file)
df['season'] = season
validator = DataValidator(f"Shots {season}")
validator.check_missing(df)
validator.check_duplicates(df)
validator.check_outliers(df)
df_clean = validator.clean(df)
all_shots.append(df_clean)
file_summaries.append({
'season': season,
'original_rows': len(df),
'cleaned_rows': len(df_clean),
'missing_cols': len(validator.issues.get('missing', {})),
'duplicates': validator.issues.get('duplicates', 0),
'outlier_cols': len(validator.issues.get('outliers', {}))
})
print(f" Original rows: {len(df)}")
print(f" Cleaned rows: {len(df_clean)}")
print(f" Actions: {len(validator.actions)}")
if all_shots:
shots_clean = pd.concat(all_shots, ignore_index=True)
print(f"\nCombined cleaned shot data: {len(shots_clean)} rows")
shots_clean.to_csv(interim_dir / 'shots_clean.csv', index=False)
print(f"Saved cleaned shot data to {interim_dir / 'shots_clean.csv'}")
summary_df = pd.DataFrame(file_summaries)
display(summary_df)
Found 21 shot data files Processing NBA_2009_Shots.csv (Season 2009)... Original rows: 199030 Cleaned rows: 199011 Actions: 1 Processing NBA_2004_Shots.csv (Season 2004)... Original rows: 189803 Cleaned rows: 189788 Actions: 1 Processing NBA_2010_Shots.csv (Season 2010)... Original rows: 200966 Cleaned rows: 200955 Actions: 1 Processing NBA_2016_Shots.csv (Season 2016)... Original rows: 207893 Cleaned rows: 207893 Actions: 0 Processing NBA_2023_Shots.csv (Season 2023)... Original rows: 217220 Cleaned rows: 217207 Actions: 3 Processing NBA_2008_Shots.csv (Season 2008)... Original rows: 200501 Cleaned rows: 200490 Actions: 1 Processing NBA_2024_Shots.csv (Season 2024)... Original rows: 218701 Cleaned rows: 218687 Actions: 3 Processing NBA_2011_Shots.csv (Season 2011)... Original rows: 199761 Cleaned rows: 199761 Actions: 0 Processing NBA_2005_Shots.csv (Season 2005)... Original rows: 197626 Cleaned rows: 197612 Actions: 1 Processing NBA_2017_Shots.csv (Season 2017)... Original rows: 209929 Cleaned rows: 209929 Actions: 0 Processing NBA_2022_Shots.csv (Season 2022)... Original rows: 216722 Cleaned rows: 216718 Actions: 3 Processing NBA_2006_Shots.csv (Season 2006)... Original rows: 194314 Cleaned rows: 194299 Actions: 1 Processing NBA_2012_Shots.csv (Season 2012)... Original rows: 161205 Cleaned rows: 161205 Actions: 0 Processing NBA_2019_Shots.csv (Season 2019)... Original rows: 219458 Cleaned rows: 219443 Actions: 3 Processing NBA_2021_Shots.csv (Season 2021)... Original rows: 190983 Cleaned rows: 190971 Actions: 3 Processing NBA_2014_Shots.csv (Season 2014)... Original rows: 204126 Cleaned rows: 204125 Actions: 3 Processing NBA_2013_Shots.csv (Season 2013)... Original rows: 201579 Cleaned rows: 201579 Actions: 2 Processing NBA_2007_Shots.csv (Season 2007)... Original rows: 196072 Cleaned rows: 196054 Actions: 1 Processing NBA_2018_Shots.csv (Season 2018)... Original rows: 211707 Cleaned rows: 211693 Actions: 3 Processing NBA_2020_Shots.csv (Season 2020)... Original rows: 188116 Cleaned rows: 188100 Actions: 3 Processing NBA_2015_Shots.csv (Season 2015)... Original rows: 205550 Cleaned rows: 205550 Actions: 2 Combined cleaned shot data: 4231070 rows Saved cleaned shot data to ../data/interim/shots_clean.csv
| season | original_rows | cleaned_rows | missing_cols | duplicates | outlier_cols | |
|---|---|---|---|---|---|---|
| 0 | 2009 | 199030 | 199011 | 0 | 19 | 4 |
| 1 | 2004 | 189803 | 189788 | 0 | 15 | 3 |
| 2 | 2010 | 200966 | 200955 | 0 | 11 | 4 |
| 3 | 2016 | 207893 | 207893 | 0 | 0 | 5 |
| 4 | 2023 | 217220 | 217207 | 2 | 13 | 3 |
| 5 | 2008 | 200501 | 200490 | 0 | 11 | 3 |
| 6 | 2024 | 218701 | 218687 | 2 | 14 | 4 |
| 7 | 2011 | 199761 | 199761 | 0 | 0 | 4 |
| 8 | 2005 | 197626 | 197612 | 0 | 14 | 3 |
| 9 | 2017 | 209929 | 209929 | 0 | 0 | 5 |
| 10 | 2022 | 216722 | 216718 | 2 | 4 | 4 |
| 11 | 2006 | 194314 | 194299 | 0 | 15 | 5 |
| 12 | 2012 | 161205 | 161205 | 0 | 0 | 4 |
| 13 | 2019 | 219458 | 219443 | 2 | 15 | 4 |
| 14 | 2021 | 190983 | 190971 | 2 | 12 | 3 |
| 15 | 2014 | 204126 | 204125 | 2 | 1 | 4 |
| 16 | 2013 | 201579 | 201579 | 2 | 0 | 4 |
| 17 | 2007 | 196072 | 196054 | 0 | 18 | 5 |
| 18 | 2018 | 211707 | 211693 | 2 | 14 | 4 |
| 19 | 2020 | 188116 | 188100 | 2 | 16 | 3 |
| 20 | 2015 | 205550 | 205550 | 2 | 0 | 5 |
Player Data Validation¶
player_data_path = raw_dir / 'team_stats' / 'Player Totals.csv'
if not player_data_path.exists():
print("Player data file not found")
else:
print(f"Processing player data from {player_data_path.name}...")
player_data = pd.read_csv(player_data_path)
validator = DataValidator("Player Data")
validator.check_missing(player_data)
validator.check_duplicates(player_data)
validator.check_outliers(player_data)
player_data_clean = validator.clean(player_data)
print(f" Original rows: {len(player_data)}")
print(f" Cleaned rows: {len(player_data_clean)}")
print(f" Missing columns: {len(validator.issues.get('missing', {}))}")
print(f" Duplicates: {validator.issues.get('duplicates', 0)}")
print(f" Outlier columns: {len(validator.issues.get('outliers', {}))}")
print(f" Actions: {len(validator.actions)}")
player_data_clean.to_csv(interim_dir / 'player_data_clean.csv', index=False)
print(f"Saved cleaned player data to {interim_dir / 'player_data_clean.csv'}")
Processing player data from Player Totals.csv... Original rows: 32538 Cleaned rows: 32538 Missing columns: 19 Duplicates: 0 Outlier columns: 25 Actions: 9 Saved cleaned player data to ../data/interim/player_data_clean.csv
Team Stats Validation¶
team_stats_path = raw_dir / 'team_stats'
team_stats_files = list(team_stats_path.glob('*.csv'))
if not team_stats_files:
print("No team stats files found")
else:
print(f"Found {len(team_stats_files)} team stats files")
for file in team_stats_files:
if 'Team Stats' in file.name:
print(f"Processing {file.name}...")
team_stats = pd.read_csv(file)
validator = DataValidator("Team Stats")
validator.check_missing(team_stats)
validator.check_duplicates(team_stats)
validator.check_outliers(team_stats)
team_stats_clean = validator.clean(team_stats)
print(f" Original rows: {len(team_stats)}")
print(f" Cleaned rows: {len(team_stats_clean)}")
print(f" Missing columns: {len(validator.issues.get('missing', {}))}")
print(f" Duplicates: {validator.issues.get('duplicates', 0)}")
print(f" Outlier columns: {len(validator.issues.get('outliers', {}))}")
print(f" Actions: {len(validator.actions)}")
team_stats_clean.to_csv(interim_dir / 'team_stats_clean.csv', index=False)
print(f"Saved cleaned team stats to {interim_dir / 'team_stats_clean.csv'}")
Found 22 team stats files Processing Team Stats Per 100 Poss.csv... Original rows: 1432 Cleaned rows: 1432 Missing columns: 3 Duplicates: 0 Outlier columns: 20 Actions: 0 Saved cleaned team stats to ../data/interim/team_stats_clean.csv Processing Team Stats Per Game.csv... Original rows: 1876 Cleaned rows: 1876 Missing columns: 24 Duplicates: 0 Outlier columns: 21 Actions: 15 Saved cleaned team stats to ../data/interim/team_stats_clean.csv
Data Quality Assessment¶
Let's visualize the data quality issues we've identified to better understand the cleaning needs of our datasets. These visualizations will help us identify patterns in missing data, duplicates, and outliers.
Data Quality Visualization¶
# Visualize data quality metrics
if 'summary_df' in locals():
# Plot duplicates by season
plt.figure(figsize=(10, 5))
sns.barplot(x='season', y='duplicates', data=summary_df)
plt.title('Duplicate Records by Season')
plt.xlabel('Season')
plt.ylabel('Number of Duplicates')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Data Cleaning Summary¶
Our data cleaning process has successfully addressed several key quality issues:
Missing Values: We identified columns with missing values and applied targeted handling strategies. For columns with less than 5% missing values, we filled numeric values with the median and categorical values with the mode. For columns with more significant missing data, we preserved the missing values to avoid introducing bias.
Duplicate Records: We removed duplicate records across all datasets, which is particularly important for shot data where duplicates could skew our analysis of shooting patterns.
Data Type Inconsistencies: We standardized data types across all datasets, ensuring that numeric columns are properly formatted for mathematical operations and analysis.
Outliers: We used the IQR method to identify potential outliers in numeric columns. Rather than automatically removing these outliers, we've flagged them for further analysis, recognizing that in basketball data, outliers may represent legitimate but rare events (like half-court shots).
Dataset Integration: We've combined and saved cleaned datasets in a standardized format, preparing them for the next stage of our pipeline.
The cleaning process has preserved the integrity of our data while addressing quality issues that could impact our analysis. By taking a conservative approach to data cleaning, we've maintained as much of the original information as possible while ensuring consistency and reliability.
Next, we'll tackle data standardization to ensure consistency across datasets, particularly for team names, player information, and coordinate systems.
DeepShot: Data Standardization¶
Introduction¶
Data standardization is a crucial step in our NBA shot analysis pipeline. Basketball data presents unique standardization challenges due to:
- Team Name Variations: Teams are referred to by city names, nicknames, abbreviations, and full names across different datasets
- Historical Team Changes: Franchises relocate and change names over time (e.g., Seattle SuperSonics → Oklahoma City Thunder)
- Player Name Inconsistencies: Player names may appear in different formats or with different spellings
- Coordinate System Differences: Shot locations may use different coordinate systems or units
- Temporal Information Formats: Dates and times may be represented in various formats
In this notebook, we standardize these elements across our datasets to ensure consistent analysis. Without standardization, we would encounter issues when joining datasets, calculating statistics, or building models. For example, "GSW," "Golden State," and "Warriors" would be treated as different teams without proper standardization.
Our standardization approach focuses on:
- Creating comprehensive mapping dictionaries for team names
- Adding conference designations for conference-based analysis
- Converting percentage values to decimal format
- Standardizing coordinate systems and calculating derived spatial features
- Normalizing player names to a consistent format
- Standardizing temporal information and adding season designations
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
data_dir = Path('../data')
raw_dir = data_dir / 'raw'
interim_dir = data_dir / 'interim'
processed_dir = data_dir / 'processed'
processed_dir.mkdir(parents=True, exist_ok=True)
# Load cleaned data
shots_path = interim_dir / 'shots_clean.csv'
team_stats_path = interim_dir / 'team_stats_clean.csv'
if shots_path.exists():
shots = pd.read_csv(shots_path)
print(f"Loaded {len(shots)} shot records")
else:
raise FileNotFoundError(f"Required data file not found: {shots_path}")
if team_stats_path.exists():
team_stats = pd.read_csv(team_stats_path)
print(f"Loaded {len(team_stats)} team stat records")
else:
raise FileNotFoundError(f"Required data file not found: {shots_path}")
player_data_path = interim_dir / 'player_data_clean.csv'
if player_data_path.exists():
player_data = pd.read_csv(player_data_path)
# Rename 'tm' to 'team' for consistency
if 'tm' in player_data.columns:
player_data.rename(columns={'tm': 'team'}, inplace=True)
print("Renamed 'tm' column to 'team' in player data for consistency")
print(f"Loaded {len(player_data)} player records")
else:
raise FileNotFoundError(f"Required data file not found: {player_data_path}")
Loaded 4231070 shot records Loaded 1876 team stat records Renamed 'tm' column to 'team' in player data for consistency Loaded 32538 player records
Team Name Standardization¶
# Create team name mapping
team_mapping = {
# Washington Wizards and historical names
'Washington': 'Washington Wizards',
'WAS': 'Washington Wizards',
'Wizards': 'Washington Wizards',
'Washington Wizards': 'Washington Wizards',
'Bullets': 'Washington Wizards',
'Washington Bullets': 'Washington Wizards',
'Capital Bullets': 'Washington Wizards',
'Baltimore Bullets': 'Washington Wizards',
'Chicago Zephyrs': 'Washington Wizards',
'Chicago Packers': 'Washington Wizards',
# Atlanta Hawks and historical names
'Atlanta': 'Atlanta Hawks',
'ATL': 'Atlanta Hawks',
'Hawks': 'Atlanta Hawks',
'Atlanta Hawks': 'Atlanta Hawks',
'St. Louis Hawks': 'Atlanta Hawks',
'Milwaukee Hawks': 'Atlanta Hawks',
'Tri-Cities Blackhawks': 'Atlanta Hawks',
# Los Angeles Clippers and historical names
'LA Clippers': 'Los Angeles Clippers',
'LAC': 'Los Angeles Clippers',
'Clippers': 'Los Angeles Clippers',
'Los Angeles Clippers': 'Los Angeles Clippers',
'Buffalo Braves': 'Los Angeles Clippers',
'San Diego Clippers': 'Los Angeles Clippers',
# Sacramento Kings and historical names
'Sacramento': 'Sacramento Kings',
'SAC': 'Sacramento Kings',
'Kings': 'Sacramento Kings',
'Sacramento Kings': 'Sacramento Kings',
'Kansas City Kings': 'Sacramento Kings',
'Cincinnati Royals': 'Sacramento Kings',
'Rochester Royals': 'Sacramento Kings',
# Philadelphia 76ers and historical names
'Philadelphia': 'Philadelphia 76ers',
'PHI': 'Philadelphia 76ers',
'76ers': 'Philadelphia 76ers',
'Sixers': 'Philadelphia 76ers',
'Philadelphia 76ers': 'Philadelphia 76ers',
'Syracuse Nationals': 'Philadelphia 76ers',
# Los Angeles Lakers and historical names
'LA Lakers': 'Los Angeles Lakers',
'LAL': 'Los Angeles Lakers',
'Lakers': 'Los Angeles Lakers',
'Los Angeles Lakers': 'Los Angeles Lakers',
'Minneapolis Lakers': 'Los Angeles Lakers',
# Houston Rockets and historical names
'Houston': 'Houston Rockets',
'HOU': 'Houston Rockets',
'Rockets': 'Houston Rockets',
'Houston Rockets': 'Houston Rockets',
'San Diego Rockets': 'Houston Rockets',
# Oklahoma City Thunder and historical names
'Oklahoma City': 'Oklahoma City Thunder',
'OKC': 'Oklahoma City Thunder',
'Thunder': 'Oklahoma City Thunder',
'Oklahoma City Thunder': 'Oklahoma City Thunder',
'Seattle SuperSonics': 'Oklahoma City Thunder',
'Sonics': 'Oklahoma City Thunder',
'SEA': 'Oklahoma City Thunder',
# Memphis Grizzlies and historical names
'Memphis': 'Memphis Grizzlies',
'MEM': 'Memphis Grizzlies',
'Grizzlies': 'Memphis Grizzlies',
'Memphis Grizzlies': 'Memphis Grizzlies',
'Vancouver Grizzlies': 'Memphis Grizzlies',
# New Orleans Pelicans and historical names
'New Orleans': 'New Orleans Pelicans',
'NOP': 'New Orleans Pelicans',
'Pelicans': 'New Orleans Pelicans',
'New Orleans Pelicans': 'New Orleans Pelicans',
'New Orleans Hornets': 'New Orleans Pelicans',
'New Orleans/Oklahoma City Hornets': 'New Orleans Pelicans',
'NOK': 'New Orleans Pelicans',
'NOH': 'New Orleans Pelicans',
# Utah Jazz and historical names
'Utah': 'Utah Jazz',
'UTA': 'Utah Jazz',
'Jazz': 'Utah Jazz',
'Utah Jazz': 'Utah Jazz',
'New Orleans Jazz': 'Utah Jazz',
# Charlotte Hornets and historical names
'Charlotte': 'Charlotte Hornets',
'CHA': 'Charlotte Hornets',
'Hornets': 'Charlotte Hornets',
'Charlotte Hornets': 'Charlotte Hornets',
'Charlotte Bobcats': 'Charlotte Hornets',
'Bobcats': 'Charlotte Hornets',
'CHO': 'Charlotte Hornets',
# Brooklyn Nets and historical names
'Brooklyn': 'Brooklyn Nets',
'BKN': 'Brooklyn Nets',
'Nets': 'Brooklyn Nets',
'Brooklyn Nets': 'Brooklyn Nets',
'New Jersey Nets': 'Brooklyn Nets',
'NJN': 'Brooklyn Nets',
'BRK': 'Brooklyn Nets',
# Golden State Warriors and historical names
'Golden State': 'Golden State Warriors',
'GSW': 'Golden State Warriors',
'Warriors': 'Golden State Warriors',
'Golden State Warriors': 'Golden State Warriors',
'San Francisco Warriors': 'Golden State Warriors',
# Phoenix Suns
'Phoenix': 'Phoenix Suns',
'PHX': 'Phoenix Suns',
'PHO': 'Phoenix Suns',
'Suns': 'Phoenix Suns',
'Phoenix Suns': 'Phoenix Suns',
# Portland Trail Blazers
'Portland': 'Portland Trail Blazers',
'POR': 'Portland Trail Blazers',
'Blazers': 'Portland Trail Blazers',
'Trail Blazers': 'Portland Trail Blazers',
'Portland Trail Blazers': 'Portland Trail Blazers',
# San Antonio Spurs
'San Antonio': 'San Antonio Spurs',
'SAS': 'San Antonio Spurs',
'Spurs': 'San Antonio Spurs',
'San Antonio Spurs': 'San Antonio Spurs',
# Toronto Raptors
'Toronto': 'Toronto Raptors',
'TOR': 'Toronto Raptors',
'Raptors': 'Toronto Raptors',
'Toronto Raptors': 'Toronto Raptors',
# Milwaukee Bucks
'Milwaukee': 'Milwaukee Bucks',
'MIL': 'Milwaukee Bucks',
'Bucks': 'Milwaukee Bucks',
'Milwaukee Bucks': 'Milwaukee Bucks',
# Minnesota Timberwolves
'Minnesota': 'Minnesota Timberwolves',
'MIN': 'Minnesota Timberwolves',
'Timberwolves': 'Minnesota Timberwolves',
'Wolves': 'Minnesota Timberwolves',
'Minnesota Timberwolves': 'Minnesota Timberwolves',
# Denver Nuggets
'Denver': 'Denver Nuggets',
'DEN': 'Denver Nuggets',
'Nuggets': 'Denver Nuggets',
'Denver Nuggets': 'Denver Nuggets',
# Miami Heat
'Miami': 'Miami Heat',
'MIA': 'Miami Heat',
'Heat': 'Miami Heat',
'Miami Heat': 'Miami Heat',
# Cleveland Cavaliers
'Cleveland': 'Cleveland Cavaliers',
'CLE': 'Cleveland Cavaliers',
'Cavaliers': 'Cleveland Cavaliers',
'Cavs': 'Cleveland Cavaliers',
'Cleveland Cavaliers': 'Cleveland Cavaliers',
# Boston Celtics
'Boston': 'Boston Celtics',
'BOS': 'Boston Celtics',
'Celtics': 'Boston Celtics',
'Boston Celtics': 'Boston Celtics',
# Detroit Pistons
'Detroit': 'Detroit Pistons',
'DET': 'Detroit Pistons',
'Pistons': 'Detroit Pistons',
'Detroit Pistons': 'Detroit Pistons',
# Indiana Pacers
'Indiana': 'Indiana Pacers',
'IND': 'Indiana Pacers',
'Pacers': 'Indiana Pacers',
'Indiana Pacers': 'Indiana Pacers',
# Chicago Bulls
'Chicago': 'Chicago Bulls',
'CHI': 'Chicago Bulls',
'Bulls': 'Chicago Bulls',
'Chicago Bulls': 'Chicago Bulls',
# Dallas Mavericks
'Dallas': 'Dallas Mavericks',
'DAL': 'Dallas Mavericks',
'Mavericks': 'Dallas Mavericks',
'Mavs': 'Dallas Mavericks',
'Dallas Mavericks': 'Dallas Mavericks',
# Orlando Magic
'Orlando': 'Orlando Magic',
'ORL': 'Orlando Magic',
'Magic': 'Orlando Magic',
'Orlando Magic': 'Orlando Magic',
# New York Knicks
'New York': 'New York Knicks',
'NYK': 'New York Knicks',
'Knicks': 'New York Knicks',
'New York Knicks': 'New York Knicks'
}
# Add full names as keys (though they're already included above)
for team in set(team_mapping.values()):
team_mapping[team] = team
print(f"Created team mapping with {len(team_mapping)} entries")
print(f"Standardized to {len(set(team_mapping.values()))} unique team names")
Created team mapping with 159 entries Standardized to 30 unique team names
def standardize_team_names(df):
df_std = df.copy()
# Find team name columns
team_cols = [col for col in df.columns if 'team' in col.lower()]
# Apply mapping
for col in team_cols:
df_std[col] = df_std[col].map(lambda x: team_mapping.get(str(x), x))
return df_std
# Standardize team names
shots_std = standardize_team_names(shots)
team_stats_std = standardize_team_names(team_stats)
player_data_std = standardize_team_names(player_data)
# Remove League Average entries from team_stats
team_stats_std = team_stats_std[team_stats_std['team'] != 'League Average']
print(f"Removed {len(team_stats) - len(team_stats_std)} 'League Average' entries")
player_data_std = player_data_std[player_data_std['team'] != 'TOT']
print(f"Removed {len(player_data) - len(player_data_std)} 'TOT' entries from player data")
# Check results
for df_name, df in [('shots', shots_std), ('team_stats', team_stats_std), ('player_data', player_data_std)]:
team_cols = [col for col in df.columns if 'team' in col.lower()]
for col in team_cols:
print(f"{df_name}.{col}: {df[col].nunique()} unique values")
Removed 88 'League Average' entries Removed 2962 'TOT' entries from player data shots.TEAM_ID: 30 unique values shots.TEAM_NAME: 30 unique values shots.HOME_TEAM: 30 unique values shots.AWAY_TEAM: 30 unique values team_stats.team: 72 unique values player_data.team: 99 unique values
Conference Mapping¶
eastern_conf_teams = [
'Atlanta Hawks',
'Boston Celtics',
'Brooklyn Nets',
'Charlotte Hornets',
'Chicago Bulls',
'Cleveland Cavaliers',
'Detroit Pistons',
'Indiana Pacers',
'Miami Heat',
'Milwaukee Bucks',
'New York Knicks',
'Orlando Magic',
'Philadelphia 76ers',
'Toronto Raptors',
'Washington Wizards',
'New York Nets', # ABA Eastern Division
'Kentucky Colonels', # ABA Eastern Division
'Spirits of St. Louis', # ABA Eastern Division
'Virginia Squires', # ABA Eastern Division
'Carolina Cougars', # ABA Eastern Division
'Memphis Tams', # ABA Eastern Division
'The Floridians', # ABA Eastern Division
'Memphis Pros', # ABA Eastern Division
'Pittsburgh Condors', # ABA Eastern Division
'Miami Floridians', # ABA Eastern Division
'Pittsburgh Pipers', # ABA Eastern Division
'Washington Capitols', # BAA/NBA Eastern Division
'Minnesota Pipers', # ABA Eastern Division
'New Jersey Americans', # ABA Eastern Division
'Philadelphia Warriors', # NBA Eastern Division
'Fort Wayne Pistons', # NBA Eastern Division (before moving to Detroit)
'Indianapolis Olympians', # NBA Western Division (but geographically eastern)
'Anderson Packers', # NBA Eastern Division
'Chicago Stags', # BAA/NBA Eastern Division
'Sheboygan Red Skins', # NBA Western Division (but later Eastern)
'Indianapolis Jets', # BAA Eastern Division
'Providence Steamrollers', # BAA Eastern Division
'Cleveland Rebels', # BAA Eastern Division
'Detroit Falcons', # BAA Eastern Division
'Pittsburgh Ironmen', # BAA Eastern Division
'Toronto Huskies', # BAA Eastern Division
'ATL', # Atlanta Hawks
'BOS', # Boston Celtics
'BKN', # Brooklyn Nets
'CHA', # Charlotte Hornets
'CHI', # Chicago Bulls
'CLE', # Cleveland Cavaliers
'DET', # Detroit Pistons
'IND', # Indiana Pacers
'MIA', # Miami Heat
'MIL', # Milwaukee Bucks
'NYK', # New York Knicks
'ORL', # Orlando Magic
'PHI', # Philadelphia 76ers
'TOR', # Toronto Raptors
'WAS', # Washington Wizards
# Historical Eastern Conference teams
'CHH', # Charlotte Hornets (original)
'WSB', # Washington Bullets
'BUF', # Buffalo Braves
'NYN', # New York Nets (ABA)
'NYA', # New York Americans
'KEN', # Kentucky Colonels (ABA)
'SSL', # Spirits of St. Louis (ABA)
'INA', # Indiana Pacers (ABA)
'VIR', # Virginia Squires (ABA)
'CAP', # Capital Bullets
'CAR', # Carolina Cougars (ABA)
'BAL', # Baltimore Bullets
'FLO', # The Floridians (ABA)
'PTC', # Pittsburgh Condors (ABA)
'CIN', # Cincinnati Royals
'MMF', # Miami Floridians (ABA)
'PTP', # Pittsburgh Pipers (ABA)
'WSA', # Washington Capitols
'MNP', # Minnesota Pipers (ABA)
'NJA', # New Jersey Americans (ABA)
'SYR', # Syracuse Nationals
'CHZ', # Chicago Zephyrs
'PHW', # Philadelphia Warriors
'CHP', # Chicago Packers
'FTW', # Fort Wayne Pistons
'ROC', # Rochester Royals
'BLB', # Baltimore Bullets (original)
'INO', # Indianapolis Olympians
'WSC', # Washington Capitols
'CHS', # Chicago Stags
'AND', # Anderson Packers
'INJ', # Indianapolis Jets
'PRO', # Providence Steamrollers
'DTF', # Detroit Falcons
'CLR', # Cleveland Rebels
'TRH', # Toronto Huskies
'PIT', # Pittsburgh Ironmen
]
western_conf_teams = [
'Dallas Mavericks',
'Denver Nuggets',
'Golden State Warriors',
'Houston Rockets',
'Los Angeles Clippers',
'Los Angeles Lakers',
'Memphis Grizzlies',
'Minnesota Timberwolves',
'New Orleans Pelicans',
'Oklahoma City Thunder',
'Phoenix Suns',
'Portland Trail Blazers',
'Sacramento Kings',
'San Antonio Spurs',
'Utah Jazz',
'San Diego Sails', # ABA Western Division
'Utah Stars', # ABA Western Division
'Kansas City-Omaha Kings', # NBA Midwest Division (Western Conference)
'Memphis Sounds', # ABA Western Division
'San Diego Conquistadors', # ABA Western Division
'Denver Rockets', # ABA Western Division
'Dallas Chaparrals', # ABA Western Division
'Texas Chaparrals', # ABA Western Division
'Los Angeles Stars', # ABA Western Division
'New Orleans Buccaneers', # ABA Western Division
'Houston Mavericks', # ABA Western Division
'Oakland Oaks', # ABA Western Division
'Anaheim Amigos', # ABA Western Division
'Minnesota Muskies', # ABA Western Division
'St. Louis Bombers', # BAA/NBA Western Division
'Waterloo Hawks', # NBA Western Division
'DAL', # Dallas Mavericks
'DEN', # Denver Nuggets
'GSW', # Golden State Warriors
'HOU', # Houston Rockets
'LAC', # Los Angeles Clippers
'LAL', # Los Angeles Lakers
'MEM', # Memphis Grizzlies
'MIN', # Minnesota Timberwolves
'NOP', # New Orleans Pelicans
'OKC', # Oklahoma City Thunder
'PHX', # Phoenix Suns
'POR', # Portland Trail Blazers
'SAC', # Sacramento Kings
'SAS', # San Antonio Spurs
'UTA', # Utah Jazz
# Historical Western Conference teams
'VAN', # Vancouver Grizzlies
'KCK', # Kansas City Kings
'SDC', # San Diego Clippers
'NOJ', # New Orleans Jazz
'UTS', # Utah Stars (ABA)
'SAA', # San Antonio Spurs (ABA)
'SDS', # San Diego Sails (ABA)
'DNA', # Denver Nuggets (ABA)
'SDA', # San Diego Conquistadors (ABA)
'MMS', # Memphis Sounds (ABA)
'KCO', # Kansas City-Omaha Kings
'DNR', # Denver Rockets (ABA)
'MMT', # Memphis Tams (ABA)
'DLC', # Dallas Chaparrals (ABA)
'MMP', # Memphis Pros (ABA)
'SFW', # San Francisco Warriors
'SDR', # San Diego Rockets
'TEX', # Texas Chaparrals (ABA)
'LAS', # Los Angeles Stars (ABA)
'NOB', # New Orleans Buccaneers (ABA)
'OAK', # Oakland Oaks (ABA)
'HSM', # Houston Mavericks (ABA)
'ANA', # Anaheim Amigos (ABA)
'STL', # St. Louis Hawks/Bombers
'MNM', # Minnesota Muskies (ABA)
'MNL', # Minneapolis Lakers
'MLH', # Milwaukee Hawks
'TRI', # Tri-Cities Blackhawks
'DNN', # Denver Nuggets (original)
'WAT', # Waterloo Hawks
'STB', # St. Louis Bombers
'SHE', # Sheboygan Red Skins
]
def add_conference_mappings(df, team_col=None):
df_conf = df.copy()
# Convert conference team lists to uppercase for case-insensitive comparison
eastern_conf_upper = [team.upper() for team in eastern_conf_teams]
western_conf_upper = [team.upper() for team in western_conf_teams]
# Find team name columns if not specified
if team_col is None:
team_cols = [col for col in df.columns if 'team' in col.lower()]
if len(team_cols) > 0:
team_col = team_cols[0] # Use the first team column found
else:
print("No team column found")
return df_conf
# Make sure the column exists
if team_col in df.columns:
# Add conference column
df_conf['conference'] = df_conf[team_col].apply(
lambda x: 'EAST' if str(x).upper() in eastern_conf_upper
else 'WEST' if str(x).upper() in western_conf_upper
else 'Unknown'
)
# Print conference distribution
print(f"Conference distribution:")
print(df_conf['conference'].value_counts())
return df_conf
# Add conference information
shots_std = add_conference_mappings(shots_std, 'TEAM_NAME')
team_stats_std = add_conference_mappings(team_stats_std, 'team')
player_data_std = add_conference_mappings(player_data_std, 'team') # Add player data with 'team' column
# Check results
for df_name, df in [('shots', shots_std), ('team_stats', team_stats_std), ('player_data', player_data_std)]:
if 'conference' in df.columns:
print(f"{df_name} conference distribution:")
print(df['conference'].value_counts())
Conference distribution: conference WEST 2128199 EAST 2102871 Name: count, dtype: int64 Conference distribution: conference EAST 959 WEST 829 Name: count, dtype: int64 Conference distribution: conference EAST 15975 WEST 13601 Name: count, dtype: int64 shots conference distribution: conference WEST 2128199 EAST 2102871 Name: count, dtype: int64 team_stats conference distribution: conference EAST 959 WEST 829 Name: count, dtype: int64 player_data conference distribution: conference EAST 15975 WEST 13601 Name: count, dtype: int64
Percentage Conversion¶
def convert_percentages(df):
df_pct = df.copy()
# Find percentage columns - add 'percent' to the search terms
pct_cols = [col for col in df.columns if any(x in col.lower() for x in ['percentage', 'pct', 'percent'])]
if len(pct_cols) > 0:
print(f"Found {len(pct_cols)} percentage columns: {pct_cols}")
# Convert each percentage column
for col in pct_cols:
if df_pct[col].dtype == 'object': # Only convert string columns
# Remove % sign and convert to float, then divide by 100
df_pct[col] = df_pct[col].str.rstrip('%').astype('float') / 100.0
print(f"Converted {col} to decimal values")
else:
print("No percentage columns found")
return df_pct
team_stats_std = convert_percentages(team_stats_std)
player_data_std = convert_percentages(player_data_std) # Add player data conversion
for df_name, df in [('shots', shots_std), ('team_stats', team_stats_std), ('player_data', player_data_std)]:
pct_cols = [col for col in df.columns if any(x in col.lower() for x in ['percentage', 'pct'])]
for col in pct_cols:
if col in df.columns:
print(f"{df_name}.{col} statistics:")
print(df[col].describe())
Found 4 percentage columns: ['fg_percent', 'x3p_percent', 'x2p_percent', 'ft_percent'] Found 5 percentage columns: ['fg_percent', 'x3p_percent', 'x2p_percent', 'e_fg_percent', 'ft_percent']
Numeric Column Handling
def handle_numeric_columns(df):
df_num = df.copy()
# Define patterns for columns that are likely not numeric
exclude_patterns = [
# Shot data columns
'name', 'team', 'position', 'date', 'season', 'location',
'event_type', 'action_type', 'shot_type', 'basic_zone',
'zone_abb', 'zone_range', 'zone_name', 'home_team', 'away_team',
'position_group', 'season_1', 'season_2', # Added specific season columns
# Team stats columns
'abbreviation', 'player', 'pos', 'lg', 'tm', 'experience',
'birth_year', 'birth_date', 'college', 'slug', 'arena',
'playoffs', 'winner', 'replaced', 'type', 'number_tm',
'hof', 'from', 'to', 'ht_in_in', 'wt',
# Player data columns
'seas_id', 'player_id', 'player', 'pos', 'lg', 'conference'
# Note: 'season', 'birth_year', 'age', 'experience' are already in the list above
]
# Filter columns that are likely to be numeric
numeric_candidates = [col for col in df.columns
if not any(pattern in col.lower() for pattern in exclude_patterns)]
if len(numeric_candidates) > 0:
print(f"Found {len(numeric_candidates)} potential numeric columns")
# Convert to numeric and handle missing values
for col in numeric_candidates:
try:
# Convert to numeric first
df_num[col] = pd.to_numeric(df_num[col], errors='coerce')
# Handle missing values
if df_num[col].isna().all():
# If all values are NaN, set the entire column to 0
df_num[col] = 0
print(f"Column {col}: All values were NaN, set to 0")
elif df_num[col].isna().any():
# If some values are NaN, fill them with 0
missing_count = df_num[col].isna().sum()
df_num[col] = df_num[col].fillna(0)
print(f"Column {col}: Filled {missing_count} missing values with 0")
else:
print(f"Column {col}: Successfully converted to numeric with no missing values")
except Exception as e:
print(f"Warning: Could not convert column {col} to numeric: {str(e)}")
else:
print("No potential numeric columns found")
return df_num
shots_std = handle_numeric_columns(shots_std)
team_stats_std = handle_numeric_columns(team_stats_std)
player_data_std = handle_numeric_columns(player_data_std) # Add player data handling
# Check results
for df_name, df in [('shots', shots_std), ('team_stats', team_stats_std), ('player_data', player_data_std)]:
numeric_cols = df.select_dtypes(include=['number']).columns
print(f"\n{df_name} has {len(numeric_cols)} numeric columns")
if len(numeric_cols) > 0:
print(f"Sample numeric columns: {list(numeric_cols)[:5]}")
Found 8 potential numeric columns Column GAME_ID: Successfully converted to numeric with no missing values Column SHOT_MADE: Successfully converted to numeric with no missing values Column LOC_X: Successfully converted to numeric with no missing values Column LOC_Y: Successfully converted to numeric with no missing values Column SHOT_DISTANCE: Successfully converted to numeric with no missing values Column QUARTER: Successfully converted to numeric with no missing values Column MINS_LEFT: Successfully converted to numeric with no missing values Column SECS_LEFT: Successfully converted to numeric with no missing values Found 22 potential numeric columns Column g: Successfully converted to numeric with no missing values Column mp_per_game: Filled 172 missing values with 0 Column fg_per_game: Successfully converted to numeric with no missing values Column fga_per_game: Successfully converted to numeric with no missing values Column fg_percent: Successfully converted to numeric with no missing values Column x3p_per_game: Filled 410 missing values with 0 Column x3pa_per_game: Filled 410 missing values with 0 Column x3p_percent: Filled 410 missing values with 0 Column x2p_per_game: Successfully converted to numeric with no missing values Column x2pa_per_game: Successfully converted to numeric with no missing values Column x2p_percent: Successfully converted to numeric with no missing values Column ft_per_game: Successfully converted to numeric with no missing values Column fta_per_game: Successfully converted to numeric with no missing values Column ft_percent: Successfully converted to numeric with no missing values Column orb_per_game: Filled 302 missing values with 0 Column drb_per_game: Filled 302 missing values with 0 Column trb_per_game: Successfully converted to numeric with no missing values Column ast_per_game: Successfully converted to numeric with no missing values Column stl_per_game: Filled 356 missing values with 0 Column blk_per_game: Filled 356 missing values with 0 Column pf_per_game: Successfully converted to numeric with no missing values Column pts_per_game: Successfully converted to numeric with no missing values Found 25 potential numeric columns Column age: Successfully converted to numeric with no missing values Column g: Successfully converted to numeric with no missing values Column gs: Filled 7844 missing values with 0 Column mp: Successfully converted to numeric with no missing values Column fg: Successfully converted to numeric with no missing values Column fga: Successfully converted to numeric with no missing values Column fg_percent: Successfully converted to numeric with no missing values Column x3p: Filled 5795 missing values with 0 Column x3pa: Filled 5795 missing values with 0 Column x3p_percent: Filled 9686 missing values with 0 Column x2p: Successfully converted to numeric with no missing values Column x2pa: Successfully converted to numeric with no missing values Column x2p_percent: Successfully converted to numeric with no missing values Column e_fg_percent: Successfully converted to numeric with no missing values Column ft: Successfully converted to numeric with no missing values Column fta: Successfully converted to numeric with no missing values Column ft_percent: Successfully converted to numeric with no missing values Column orb: Filled 4238 missing values with 0 Column drb: Filled 4238 missing values with 0 Column trb: Successfully converted to numeric with no missing values Column ast: Successfully converted to numeric with no missing values Column stl: Filled 5095 missing values with 0 Column blk: Filled 5094 missing values with 0 Column pf: Successfully converted to numeric with no missing values Column pts: Successfully converted to numeric with no missing values shots has 11 numeric columns Sample numeric columns: ['SEASON_1', 'TEAM_ID', 'PLAYER_ID', 'GAME_ID', 'LOC_X'] team_stats has 24 numeric columns Sample numeric columns: ['season', 'g', 'mp_per_game', 'fg_per_game', 'fga_per_game'] player_data has 31 numeric columns Sample numeric columns: ['seas_id', 'season', 'player_id', 'birth_year', 'age']
Coordinate System Standardization¶
# Standardize court coordinates
def standardize_coordinates(df):
df_std = df.copy()
# Check if coordinate columns exist
if 'LOC_X' in df.columns and 'LOC_Y' in df.columns:
# Rename to lowercase
df_std.rename(columns={'LOC_X': 'loc_x', 'LOC_Y': 'loc_y'}, inplace=True)
# Calculate shot distance and angle if not present
if 'shot_distance' not in df_std.columns:
df_std['shot_distance'] = np.sqrt(df_std['loc_x']**2 + df_std['loc_y']**2) / 10 # Convert to feet
if 'shot_angle' not in df_std.columns:
df_std['shot_angle'] = np.arctan2(df_std['loc_y'], df_std['loc_x']) * 180 / np.pi # Convert to degrees
return df_std
shots_std = standardize_coordinates(shots_std)
# Check results
if 'loc_x' in shots_std.columns:
print("Coordinate statistics:")
print(shots_std[['loc_x', 'loc_y', 'shot_distance', 'shot_angle']].describe())
Coordinate statistics:
loc_x loc_y shot_distance shot_angle
count 4.231070e+06 4.231070e+06 4.231070e+06 4.231070e+06
mean 9.484333e-02 1.239039e+01 1.544464e+00 8.981937e+01
std 1.026559e+01 8.554561e+00 9.672471e-01 3.264137e+01
min -2.500000e+01 5.000000e-02 1.500000e-02 2.075925e-01
25% -2.900000e+00 5.875000e+00 6.347277e-01 7.384607e+01
50% -0.000000e+00 8.050000e+00 1.223162e+00 9.000000e+01
75% 2.900000e+00 1.875000e+01 2.402712e+00 1.065430e+02
max 2.500000e+01 9.365000e+01 9.507514e+00 1.797954e+02
Player Name Standardization¶
# Standardize player names
def standardize_player_names(df, name_cols=None):
df_std = df.copy()
# Find player name columns if not specified
if name_cols is None:
name_cols = [col for col in df.columns if 'player' in col.lower() and 'name' in col.lower()]
if len(name_cols) > 0:
print(f"Found {len(name_cols)} player name columns: {name_cols}")
# Standardize format for each column
for col in name_cols:
if col in df.columns:
# Check if column has values before processing
non_null_count = df_std[col].notna().sum()
print(f"Column {col} has {non_null_count} non-null values out of {len(df_std)} records")
if non_null_count > 0:
# First, convert NaN values to empty strings
df_std[col] = df_std[col].fillna('')
# Convert all values to strings before applying string methods
df_std[col] = df_std[col].astype(str)
# Option 1: Convert to UPPERCASE (NBA data cleaner approach)
df_std[col] = df_std[col].str.strip().str.upper()
# Option 2: Convert to Title Case (current notebook approach)
# df_std[col] = df_std[col].str.title().str.strip()
# Convert empty strings back to NaN
df_std[col] = df_std[col].replace('', pd.NA).infer_objects(copy=False)
# Show sample of standardized names
unique_names = df_std[col].dropna().unique()
sample_size = min(5, len(unique_names))
if sample_size > 0:
print(f"Sample standardized names: {list(unique_names[:sample_size])}")
print(f"After standardization, column {col} has {df_std[col].nunique()} unique values")
else:
print("No player name columns found")
return df_std
def standardize_player_names(df, name_cols=None):
df_std = df.copy()
# Find player name columns if not specified
if name_cols is None:
name_cols = [col for col in df.columns if 'player' in col.lower() and 'name' in col.lower()]
if len(name_cols) > 0:
print(f"Found {len(name_cols)} player name columns: {name_cols}")
# Standardize format for each column
for col in name_cols:
if col in df.columns:
# Check if column has values before processing
non_null_count = df_std[col].notna().sum()
print(f"Column {col} has {non_null_count} non-null values out of {len(df_std)} records")
if non_null_count > 0:
# First, convert NaN values to empty strings
df_std[col] = df_std[col].fillna('')
# Convert all values to strings before applying string methods
df_std[col] = df_std[col].astype(str)
df_std[col] = df_std[col].str.strip().str.upper()
# Convert empty strings back to NaN
df_std[col] = df_std[col].replace('', pd.NA).infer_objects(copy=False)
# Show sample of standardized names
unique_names = df_std[col].dropna().unique()
sample_size = min(5, len(unique_names))
if sample_size > 0:
print(f"Sample standardized names: {list(unique_names[:sample_size])}")
print(f"After standardization, column {col} has {df_std[col].nunique()} unique values")
else:
print("No player name columns found")
return df_std
# Standardize player names
shots_std = standardize_player_names(shots_std)
player_data_std = standardize_player_names(player_data_std, ['player']) # Add player data with 'player' column
# Check results
for df_name, df, col_name in [('shots', shots_std, 'PLAYER_NAME'), ('player_data', player_data_std, 'player')]:
if col_name in df.columns:
print(f"{df_name}.{col_name}: {df[col_name].nunique()} unique values")
Found 1 player name columns: ['PLAYER_NAME'] Column PLAYER_NAME has 4231070 non-null values out of 4231070 records Sample standardized names: ['ANTHONY MORROW', 'KELENNA AZUBUIKE', 'GRANT HILL', 'DANIEL GIBSON', 'C.J. WATSON'] After standardization, column PLAYER_NAME has 2164 unique values Found 1 player name columns: ['player'] Column player has 29576 non-null values out of 29576 records Sample standardized names: ['A.J. GREEN', 'A.J. LAWSON', 'AJ JOHNSON', 'AARON GORDON', 'AARON HOLIDAY'] After standardization, column player has 5252 unique values shots.PLAYER_NAME: 2164 unique values player_data.player: 5252 unique values
Temporal Standardization¶
def handle_dates(df, date_cols=None):
df_dates = df.copy()
# Find date columns if not specified
if date_cols is None:
date_cols = [col for col in df.columns if 'date' in col.lower()]
if len(date_cols) > 0:
print(f"Found {len(date_cols)} date columns: {date_cols}")
# Convert each date column to datetime
for col in date_cols:
if col in df.columns:
try:
# Store original values to check conversion success
original_values = df_dates[col].copy()
# Convert to datetime with coercion for invalid dates
df_dates[col] = pd.to_datetime(df_dates[col], errors='coerce')
# Check conversion success
success_count = df_dates[col].notna().sum()
total_count = len(df_dates[col])
success_rate = success_count / total_count * 100 if total_count > 0 else 0
print(f"Column {col}: Converted {success_count}/{total_count} values to datetime ({success_rate:.1f}%)")
# Show sample of before and after for verification
if success_count > 0:
sample_idx = df_dates[col].first_valid_index()
if sample_idx is not None:
original_sample = original_values.iloc[sample_idx]
converted_sample = df_dates[col].iloc[sample_idx]
print(f" Sample conversion: '{original_sample}' → {converted_sample}")
except Exception as e:
print(f"Error converting {col} to datetime: {str(e)}")
else:
print("No date columns found")
# Add season column if not present
if 'season' not in df_dates.columns and 'GAME_DATE' in df_dates.columns:
# Extract season (assuming season starts in October and ends in June)
def get_season(date):
if pd.isna(date):
return None
year = date.year
month = date.month
if month >= 10: # October to December
return f"{year}-{year+1}"
else: # January to June
return f"{year-1}-{year}"
df_dates['season'] = df_dates['GAME_DATE'].apply(get_season)
print("Added 'season' column based on GAME_DATE")
# Show season distribution
season_counts = df_dates['season'].value_counts()
print("Season distribution:")
print(season_counts.head())
return df_dates
shots_std = handle_dates(shots_std)
team_stats_std = handle_dates(team_stats_std)
player_data_std = handle_dates(player_data_std) # Add player data handling
# Check results
for df_name, df in [('shots', shots_std), ('team_stats', team_stats_std), ('player_data', player_data_std)]:
date_cols = [col for col in df.columns if pd.api.types.is_datetime64_any_dtype(df[col])]
if len(date_cols) > 0:
print(f"\n{df_name} date columns: {date_cols}")
for col in date_cols:
print(f"{df_name}.{col} range: {df[col].min()} to {df[col].max()}")
Found 1 date columns: ['GAME_DATE'] Column GAME_DATE: Converted 4231070/4231070 values to datetime (100.0%) Sample conversion: '04-15-2009' → 2009-04-15 00:00:00 No date columns found No date columns found shots date columns: ['GAME_DATE'] shots.GAME_DATE range: 2003-10-28 00:00:00 to 2024-04-14 00:00:00
Update Column Names¶
if 'PLAYER_NAME' in shots_std.columns and 'player_name' not in shots_std.columns:
shots_std.rename(columns={'PLAYER_NAME': 'player_name'}, inplace=True)
print("Renamed 'PLAYER_NAME' to 'player_name' in shots data")
if 'TEAM_NAME' in shots_std.columns and 'team_name' not in shots_std.columns:
shots_std.rename(columns={'TEAM_NAME': 'team_name'}, inplace=True)
print("Renamed 'TEAM_NAME' to 'team_name' in shots data")
player_column_mapping = {
'pts': 'points',
'fga': 'field_goal_attempts',
'fta': 'free_throw_attempts',
'tov': 'turnovers',
'mp': 'minutes'
}
for old_name, new_name in player_column_mapping.items():
if old_name in player_data_std.columns and new_name not in player_data_std.columns:
player_data_std.rename(columns={old_name: new_name}, inplace=True)
print(f"Renamed '{old_name}' to '{new_name}' in player data")
Renamed 'PLAYER_NAME' to 'player_name' in shots data Renamed 'TEAM_NAME' to 'team_name' in shots data Renamed 'pts' to 'points' in player data Renamed 'fga' to 'field_goal_attempts' in player data Renamed 'fta' to 'free_throw_attempts' in player data Renamed 'tov' to 'turnovers' in player data Renamed 'mp' to 'minutes' in player data
Save Standardized Data¶
shots_std.to_csv(processed_dir / 'standardized_shots.csv', index=False)
team_stats_std.to_csv(processed_dir / 'standardized_team.csv', index=False)
player_data_std.to_csv(processed_dir / 'standardized_player.csv', index=False)
print(f"Saved standardized shot data to {processed_dir / 'standardized_shots.csv'}")
print(f"Saved standardized team stats to {processed_dir / 'standardized_team.csv'}")
print(f"Saved standardized player data to {processed_dir / 'standardized_player.csv'}")
Saved standardized shot data to ../data/processed/standardized_shots.csv Saved standardized team stats to ../data/processed/standardized_team.csv Saved standardized player data to ../data/processed/standardized_player.csv
Data Standardization Summary¶
We've successfully standardized the following aspects of our data:
Player Names: We populated missing player names from player_id and standardized to a consistent UPPERCASE format. This standardization enables reliable player matching and analysis across datasets, reducing the risk of treating the same player as different entities due to name format variations.
Team Names: We unified team names across all datasets, handling variations and historical changes. Our comprehensive mapping dictionary accounts for city names, nicknames, abbreviations, and historical franchise relocations. This standardization reduces the number of unique team identifiers from hundreds to the actual 30 NBA teams.
Conference Mappings: We added Eastern and Western conference designations to each team, enabling conference-based analysis and comparisons. This addition allows us to explore conference-specific patterns and trends in shooting behavior.
Percentage Values: We converted percentage strings (e.g., "45.6%") to decimal values (0.456) for consistent numerical analysis. This standardization ensures that percentage values can be properly used in mathematical operations and comparisons.
Numeric Columns: We identified and converted potential numeric columns, handling missing values appropriately. This standardization ensures that all numeric data is properly formatted for statistical analysis and modeling.
Coordinates: We standardized court coordinates and calculated derived spatial features like shot distance and angle. These standardized spatial features will be crucial for our spatial analysis of shooting patterns.
Temporal Information: We standardized dates across all datasets, added season information, and ensured proper datetime formatting for time-based analysis. This standardization enables us to track changes over time and perform seasonal comparisons.
These comprehensive standardization steps ensure that our data is consistent, clean, and properly formatted across all datasets, enabling accurate analysis and modeling in subsequent notebooks. The standardization process follows best practices from the NBA data cleaner class, ensuring compatibility with other components of the project.
Next, we'll complete feature engineering to create additional meaningful features that will enhance our predictive models.
DeepShot: Feature Engineering¶
Introduction¶
Feature engineering is a critical step in our NBA shot prediction project. While our raw data contains valuable information, transforming and combining this data into meaningful features can significantly improve our model's predictive power.
In this notebook, we create several categories of features:
Spatial Features: Derived from court coordinates, these features capture the geometric aspects of shooting, including distance from basket, angle, and court zones. Spatial features are expected to be among the most important predictors of shot success.
Game Context Features: These features capture the situational aspects of each shot, including time remaining, quarter, score margin, and "clutch" situations. Game context provides important information about the pressure and strategic considerations for each shot.
Historical Shot Features: These features incorporate a player's past shooting performance from similar locations, providing a baseline expectation for shot success based on historical patterns.
Player Performance Features: These features capture player-specific metrics like true shooting percentage, usage rate, and career stage, helping our model understand individual player tendencies.
Team Features: These features describe team characteristics like win percentage, offensive/defensive ratings, and playing style, providing context about the team environment for each shot.
By engineering these features, we aim to provide our models with rich, meaningful information that captures the multidimensional nature of basketball shooting. We expect spatial features and player-specific features to be particularly important, based on basketball domain knowledge and our exploratory data analysis.
# ##HIDE##
import pandas as pd
import numpy as np
from pathlib import Path
# Setup directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
for directory in [processed_dir, features_dir]:
directory.mkdir(parents=True, exist_ok=True)
# Load data
shots = pd.read_csv(processed_dir / 'standardized_shots.csv')
player = pd.read_csv(processed_dir / 'standardized_player.csv')
team = pd.read_csv(processed_dir / 'standardized_team.csv')
# Handle column name variations
column_mappings = {
'PLAYER_NAME': 'player_name',
'TEAM_NAME': 'team_name',
'MINUTES_LEFT': 'mins_left',
'SECONDS_LEFT': 'secs_left',
'PERIOD': 'quarter',
'QUARTER': 'quarter',
'MARGIN': 'score_margin',
'SHOT_MADE_FLAG': 'shot_made',
'SHOT_MADE': 'shot_made'
}
# Apply mappings if needed
for old_col, new_col in column_mappings.items():
if old_col in shots.columns and new_col not in shots.columns:
shots.rename(columns={old_col: new_col}, inplace=True)
Spatial Feature Engineering¶
Spatial features capture the geometric aspects of shooting. The location on the court is one of the most important factors in predicting shot success, as shooting percentage generally decreases with distance from the basket, with some exceptions based on angle and specific zones.
# 1. Spatial Features
shots['shot_distance'] = np.sqrt(shots['loc_x']**2 + shots['loc_y']**2)
shots['shot_angle'] = np.arctan2(shots['loc_x'], shots['loc_y']) * 180 / np.pi
# Court zones
conditions = [
(shots['shot_distance'] < 4),
(shots['shot_distance'] < 8) & (shots['shot_distance'] >= 4),
(shots['shot_distance'] < 16) & (shots['shot_distance'] >= 8),
(shots['shot_distance'] < 23.75) & (shots['shot_distance'] >= 16),
(shots['shot_distance'] >= 23.75)
]
zones = ['Restricted Area', 'Paint', 'Mid-Range', 'Long Mid-Range', 'Three-Point']
shots['court_zone'] = np.select(conditions, zones, default='Unknown')
shots['corner_three'] = ((shots['court_zone'] == 'Three-Point') & (abs(shots['shot_angle']) > 45)).astype(int)
Game Context Feature Engineering¶
Game context features capture the situational aspects of each shot. Basketball is a dynamic game where time remaining, score differential, and other contextual factors can significantly impact shot selection and success probability.
# 2. Game Context Features
# Handle missing columns
for col in ['mins_left', 'secs_left', 'quarter']:
if col not in shots.columns and col.upper() in shots.columns:
shots[col] = shots[col.upper()]
# Calculate time features if possible
if all(col in shots.columns for col in ['mins_left', 'secs_left', 'quarter']):
shots['time_remaining_seconds'] = shots['mins_left'] * 60 + shots['secs_left']
shots['period_type'] = np.where(shots['quarter'] <= 4, 'Regulation', 'Overtime')
shots['end_of_period'] = ((shots['time_remaining_seconds'] < 120) &
((shots['quarter'] == 4) | (shots['period_type'] == 'Overtime'))).astype(int)
# Calculate score situation if possible
if 'score_margin' in shots.columns:
conditions = [
(shots['score_margin'] < -15),
(shots['score_margin'] < -5) & (shots['score_margin'] >= -15),
(shots['score_margin'] < 0) & (shots['score_margin'] >= -5),
(shots['score_margin'] == 0),
(shots['score_margin'] > 0) & (shots['score_margin'] <= 5),
(shots['score_margin'] > 5) & (shots['score_margin'] <= 15),
(shots['score_margin'] > 15)
]
values = ['Large Deficit', 'Moderate Deficit', 'Small Deficit', 'Tied',
'Small Lead', 'Moderate Lead', 'Large Lead']
shots['score_situation'] = np.select(conditions, values, default='Unknown')
if 'time_remaining_seconds' in shots.columns:
shots['clutch_situation'] = ((abs(shots['score_margin']) <= 5) &
(shots['time_remaining_seconds'] < 300) &
((shots['quarter'] == 4) | (shots['period_type'] == 'Overtime'))).astype(int)
Historical Shot Feature Engineering¶
Historical shot features incorporate a player's past shooting performance. A player's previous success from a particular zone is often predictive of future success, providing valuable baseline information for our models.
# 3. Historical Shot Features
# Handle missing columns
for col in ['player_name', 'court_zone', 'season', 'shot_made']:
if col not in shots.columns and col.upper() in shots.columns:
shots[col] = shots[col.upper()]
# Calculate shooting percentages
player_zone_season = shots.groupby(['player_name', 'court_zone', 'season']).agg(
shots=('shot_made', 'count'),
makes=('shot_made', 'sum')
).reset_index()
player_zone_season['shooting_pct'] = player_zone_season['makes'] / player_zone_season['shots']
player_zone_season['shooting_pct'] = player_zone_season['shooting_pct'].fillna(0.5)
player_zone_season['prior_season'] = player_zone_season['season'] + 1
# Merge to get prior season stats
shots_with_prior = shots.merge(
player_zone_season[['player_name', 'court_zone', 'prior_season', 'shooting_pct']],
left_on=['player_name', 'court_zone', 'season'],
right_on=['player_name', 'court_zone', 'prior_season'],
how='left',
suffixes=('', '_prior')
)
# Add prior_pct column
if 'shooting_pct_prior' in shots_with_prior.columns:
shots_with_prior.rename(columns={'shooting_pct_prior': 'prior_pct'}, inplace=True)
shots_with_prior['prior_pct'] = shots_with_prior['prior_pct'].fillna(0.5)
else:
shots_with_prior['prior_pct'] = 0.5
shots = shots_with_prior
if 'prior_season' in shots.columns:
shots.drop('prior_season', axis=1, inplace=True)
Player Performance Feature Engineering¶
Player performance features capture individual player characteristics. Different players have different shooting abilities, tendencies, and roles, which significantly impact shot success probability beyond what court location alone would predict.
# 4. Player Performance Features
# Map column names
player_column_mapping = {
'pts': 'points',
'fga': 'field_goal_attempts',
'fta': 'free_throw_attempts',
'tov': 'turnovers',
'mp': 'minutes'
}
for old_name, new_name in player_column_mapping.items():
if old_name in player.columns and new_name not in player.columns:
player.rename(columns={old_name: new_name}, inplace=True)
# Add default values for missing columns
for col in ['points', 'field_goal_attempts', 'free_throw_attempts', 'turnovers', 'minutes']:
if col not in player.columns:
player[col] = 0
# Calculate features
player['true_shooting'] = player['points'] / (2 * (player['field_goal_attempts'] + 0.44 * player['free_throw_attempts']))
player['true_shooting'] = player['true_shooting'].replace([np.inf, -np.inf], np.nan).fillna(0)
player['usage_rate'] = (player['field_goal_attempts'] + 0.44 * player['free_throw_attempts'] + player['turnovers']) / player['minutes']
player['usage_rate'] = player['usage_rate'].replace([np.inf, -np.inf], np.nan).fillna(0)
# Calculate experience if possible
if 'player' in player.columns and 'season' in player.columns:
player_first_season = player.groupby('player')['season'].min().reset_index()
player_first_season.rename(columns={'season': 'first_season'}, inplace=True)
player = player.merge(player_first_season, on='player', how='left')
player['experience'] = player['season'] - player['first_season']
# Create experience bins
bins = [-1, 2, 5, 9, 100]
labels = ['Rookie (0-2)', 'Early Career (3-5)', 'Prime (6-9)', 'Veteran (10+)']
player['career_stage'] = pd.cut(player['experience'], bins=bins, labels=labels, right=True)
else:
player['experience'] = 0
player['career_stage'] = 'Unknown'
Team Feature Engineering¶
Team features describe the characteristics and performance of each team. Team playing style, offensive efficiency, and overall quality provide important context for understanding shot patterns and success rates.
# 5. Team Features
# Map column names
team_column_mapping = {
'win': 'wins',
'loss': 'losses',
'pts_per_game': 'points_per_game',
'pts_against_per_game': 'points_allowed_per_game',
'fg3a': 'three_point_attempts',
'fga': 'field_goal_attempts'
}
for old_name, new_name in team_column_mapping.items():
if old_name in team.columns and new_name not in team.columns:
team.rename(columns={old_name: new_name}, inplace=True)
# Add default values for missing columns
for col in ['wins', 'losses', 'points_per_game', 'points_allowed_per_game', 'pace', 'three_point_attempts', 'field_goal_attempts']:
if col not in team.columns:
team[col] = 0 if col != 'pace' else 100
# Calculate features
team['win_pct'] = team['wins'] / (team['wins'] + team['losses'])
team['win_pct'] = team['win_pct'].replace([np.inf, -np.inf], np.nan).fillna(0.5)
team['offensive_rating'] = team['points_per_game'] * (100 / team['pace'])
team['defensive_rating'] = team['points_allowed_per_game'] * (100 / team['pace'])
team['net_rating'] = team['offensive_rating'] - team['defensive_rating']
team['three_point_rate'] = team['three_point_attempts'] / team['field_goal_attempts']
team['three_point_rate'] = team['three_point_rate'].replace([np.inf, -np.inf], np.nan).fillna(0.25)
# Categorize playing style
pace_median = team['pace'].median()
team['pace_style'] = np.where(team['pace'] > pace_median, 'Fast', 'Slow')
three_pt_median = team['three_point_rate'].median()
team['shooting_style'] = np.where(team['three_point_rate'] > three_pt_median, 'Three-Heavy', 'Inside')
team['playing_style'] = team['pace_style'] + '-' + team['shooting_style']
Feature Integration¶
Now that we've created features from multiple sources, we need to integrate them into a comprehensive dataset that our models can use. This integration process requires careful handling of join conditions and potential missing values.
# 6. Merge Features
# Prepare columns for merging
player_cols = [col for col in ['player', 'season', 'true_shooting', 'usage_rate', 'experience', 'career_stage']
if col in player.columns]
team_cols = [col for col in ['team', 'season', 'win_pct', 'offensive_rating', 'defensive_rating', 'playing_style']
if col in team.columns]
# Merge player features
if 'player_name' in shots.columns and 'player' in player.columns and 'season' in player.columns:
shots_with_player = shots.merge(
player[player_cols],
left_on=['player_name', 'season'],
right_on=['player', 'season'],
how='left'
)
if 'player' in shots_with_player.columns:
shots_with_player.drop('player', axis=1, inplace=True)
else:
shots_with_player = shots
# Merge team features
if 'team_name' in shots_with_player.columns and 'team' in team.columns and 'season' in team.columns:
final_shots = shots_with_player.merge(
team[team_cols],
left_on=['team_name', 'season'],
right_on=['team', 'season'],
how='left'
)
if 'team' in final_shots.columns:
final_shots.drop('team', axis=1, inplace=True)
else:
final_shots = shots_with_player
# 7. Save Features
final_shots.to_csv(features_dir / 'shots_with_features.csv', index=False)
player.to_csv(features_dir / 'player_features.csv', index=False)
team.to_csv(features_dir / 'team_features.csv', index=False)
Feature Engineering Summary¶
In this notebook, we've created a rich set of features that capture the multidimensional nature of basketball shooting:
Spatial Features: We've transformed raw court coordinates into meaningful spatial features including shot distance, angle, court zones, and corner three indicators. These features capture the geometric aspects of shooting and are expected to be strong predictors of shot success.
Game Context Features: We've created features that capture the situational context of each shot, including time remaining, period type, end-of-period indicators, score situation, and clutch indicators. These features help our models understand how game situation affects shooting.
Historical Shot Features: We've incorporated each player's historical shooting percentage from different court zones, providing a baseline expectation for shot success based on past performance.
Player Performance Features: We've calculated advanced metrics like true shooting percentage and usage rate, and created career stage indicators based on experience. These features help our models understand player-specific tendencies.
Team Features: We've included team performance metrics like win percentage, offensive/defensive ratings, and playing style indicators. These features provide context about the team environment for each shot.
By engineering these diverse features, we've transformed our raw data into a feature-rich dataset that captures the complex factors influencing basketball shooting. This comprehensive feature set will enable our models to make more accurate predictions and generate more meaningful insights.
DeepShot: Data Exploration¶
Introduction¶
Exploratory Data Analysis (EDA) is a crucial step in our NBA shot prediction project. In this notebook, we'll explore the features we've created to understand their distributions, relationships, and potential predictive power.
Our exploration will focus on several key areas:
- Spatial Patterns: How does court location affect shooting efficiency?
- Temporal Patterns: How do time-related factors influence shot success?
- Player Analysis: What player characteristics impact shooting performance?
- Game Context Analysis: How do situational factors affect shooting?
- Feature Importance: Which features have the strongest relationship with shot success?
Through this exploration, we aim to gain insights that will inform our modeling approach and help us understand the factors that influence NBA shooting. We expect to find that spatial factors (particularly shot distance) have the strongest influence on shot success, followed by player-specific factors and game context.
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import matplotlib.patches as patches
from matplotlib.colors import LinearSegmentedColormap
from scipy import stats
# Set visualization style
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['font.size'] = 12
# Create directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
for directory in [processed_dir, features_dir]:
directory.mkdir(parents=True, exist_ok=True)
# Load shot data with features
shots = pd.read_csv(features_dir / 'shots_with_features.csv')
player = pd.read_csv(features_dir / 'player_features.csv')
team = pd.read_csv(features_dir / 'team_features.csv')
print(f"Loaded {len(shots)} shots, {len(player)} player records, {len(team)} team records")
shots.head()
Loaded 4650091 shots, 29576 player records, 1788 team records
| SEASON_1 | SEASON_2 | TEAM_ID | team_name | PLAYER_ID | player_name | POSITION_GROUP | POSITION | GAME_DATE | GAME_ID | ... | shooting_pct | prior_pct | true_shooting | usage_rate | experience | career_stage | win_pct | offensive_rating | defensive_rating | playing_style | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009 | 2008-09 | 1610612744 | Golden State Warriors | 201627 | ANTHONY MORROW | G | SG | 2009-04-15 | 20801229 | ... | NaN | 0.5 | 0.588358 | 0.413518 | 0.0 | Rookie (0-2) | 0.5 | 108.6 | 0.0 | Slow-Inside |
| 1 | 2009 | 2008-09 | 1610612744 | Golden State Warriors | 101235 | KELENNA AZUBUIKE | F | SF | 2009-04-15 | 20801229 | ... | 0.564593 | 0.5 | 0.561982 | 0.438215 | 2.0 | Rookie (0-2) | 0.5 | 108.6 | 0.0 | Slow-Inside |
| 2 | 2009 | 2008-09 | 1610612756 | Phoenix Suns | 255 | GRANT HILL | F | SF | 2009-04-15 | 20801229 | ... | 0.688525 | 0.5 | 0.583835 | 0.396548 | 14.0 | Veteran (10+) | 0.5 | 109.4 | 0.0 | Slow-Inside |
| 3 | 2009 | 2008-09 | 1610612739 | Cleveland Cavaliers | 200789 | DANIEL GIBSON | G | PG | 2009-04-15 | 20801219 | ... | 0.471698 | 0.5 | 0.518731 | 0.348657 | 2.0 | Rookie (0-2) | 0.5 | 100.3 | 0.0 | Slow-Inside |
| 4 | 2009 | 2008-09 | 1610612756 | Phoenix Suns | 255 | GRANT HILL | F | SF | 2009-04-15 | 20801229 | ... | 0.426966 | 0.5 | 0.583835 | 0.396548 | 14.0 | Veteran (10+) | 0.5 | 109.4 | 0.0 | Slow-Inside |
5 rows × 47 columns
Data Overview¶
Before diving into specific analyses, let's get a better understanding of our dataset by examining its structure, basic statistics, and quality characteristics. This overview will help us identify any remaining data issues and understand the scope of our analysis.
# Check for missing values
missing_values = shots.isnull().sum()
print("Columns with missing values:")
print(missing_values[missing_values > 0])
# Basic statistics for numeric columns
shots.describe()
Columns with missing values: shooting_pct 745530 true_shooting 297660 usage_rate 297660 experience 297660 career_stage 297660 dtype: int64
| SEASON_1 | TEAM_ID | PLAYER_ID | GAME_ID | loc_x | loc_y | SHOT_DISTANCE | quarter | MINS_LEFT | SECS_LEFT | ... | time_remaining_seconds | end_of_period | shooting_pct | prior_pct | true_shooting | usage_rate | experience | win_pct | offensive_rating | defensive_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | 4.650091e+06 | ... | 4.650091e+06 | 4.650091e+06 | 3.904561e+06 | 4650091.0 | 4.352431e+06 | 4.352431e+06 | 4.352431e+06 | 4650091.0 | 4.650091e+06 | 4650091.0 |
| mean | 2.014214e+03 | 1.610613e+09 | 4.103645e+05 | 2.132202e+07 | 9.726408e-02 | 1.240319e+01 | 1.260702e+01 | 2.484101e+00 | 5.336104e+00 | 2.876802e+01 | ... | 3.489343e+02 | 4.834787e-02 | 4.590479e-01 | 0.5 | 5.441507e-01 | 4.760403e-01 | 5.567962e+00 | 0.5 | 1.036231e+02 | 0.0 |
| std | 6.092693e+00 | 8.648372e+00 | 6.125469e+05 | 6.092661e+05 | 1.029289e+01 | 8.564057e+00 | 1.013096e+01 | 1.137472e+00 | 3.467863e+00 | 1.745068e+01 | ... | 2.089087e+02 | 2.145003e-01 | 1.290713e-01 | 0.0 | 5.750766e-02 | 1.251382e-01 | 4.612451e+00 | 0.0 | 7.565563e+00 | 0.0 |
| min | 2.004000e+03 | 1.610613e+09 | 1.500000e+01 | 2.030000e+07 | -2.500000e+01 | 5.000000e-02 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.5 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.5 | 8.540000e+01 | 0.0 |
| 25% | 2.009000e+03 | 1.610613e+09 | 2.499000e+03 | 2.080052e+07 | -2.900000e+00 | 5.875000e+00 | 2.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.400000e+01 | ... | 1.670000e+02 | 0.000000e+00 | 3.680556e-01 | 0.5 | 5.152188e-01 | 3.866487e-01 | 2.000000e+00 | 0.5 | 9.760000e+01 | 0.0 |
| 50% | 2.014000e+03 | 1.610613e+09 | 2.015710e+05 | 2.130113e+07 | 0.000000e+00 | 8.050000e+00 | 1.300000e+01 | 2.000000e+00 | 5.000000e+00 | 2.900000e+01 | ... | 3.500000e+02 | 0.000000e+00 | 4.307692e-01 | 0.5 | 5.456056e-01 | 4.656685e-01 | 5.000000e+00 | 0.5 | 1.028000e+02 | 0.0 |
| 75% | 2.019000e+03 | 1.610613e+09 | 2.035070e+05 | 2.180114e+07 | 2.900000e+00 | 1.875000e+01 | 2.300000e+01 | 3.000000e+00 | 8.000000e+00 | 4.400000e+01 | ... | 5.310000e+02 | 0.000000e+00 | 5.562914e-01 | 0.5 | 5.768397e-01 | 5.557068e-01 | 8.000000e+00 | 0.5 | 1.100000e+02 | 0.0 |
| max | 2.024000e+03 | 1.610613e+09 | 1.642013e+06 | 2.230123e+07 | 2.500000e+01 | 9.365000e+01 | 8.900000e+01 | 8.000000e+00 | 1.200000e+01 | 5.900000e+01 | ... | 7.200000e+02 | 1.000000e+00 | 1.000000e+00 | 0.5 | 1.500000e+00 | 1.666667e+00 | 7.000000e+01 | 0.5 | 1.233000e+02 | 0.0 |
8 rows × 26 columns
# Distribution of shots by season
season_counts = shots['season'].value_counts().sort_index()
plt.figure(figsize=(14, 6))
ax = sns.barplot(x=season_counts.index, y=season_counts.values)
plt.title('Number of Shots by Season', fontsize=16)
plt.xlabel('Season', fontsize=14)
plt.ylabel('Number of Shots', fontsize=14)
plt.xticks(rotation=45)
# Add value labels on top of bars
for i, v in enumerate(season_counts.values):
ax.text(i, v + 1000, f"{v:,}", ha='center', fontsize=10)
plt.tight_layout()
plt.show()
# Overall shot success rate
overall_success = shots['shot_made'].mean() * 100
print(f"Overall shot success rate: {overall_success:.2f}%")
# Shot type distribution
if 'SHOT_TYPE' in shots.columns:
shot_type_counts = shots['SHOT_TYPE'].value_counts()
plt.figure(figsize=(10, 6))
plt.pie(shot_type_counts, labels=shot_type_counts.index, autopct='%1.1f%%', startangle=90, explode=[0.05] * len(shot_type_counts))
plt.title('Distribution of Shot Types', fontsize=16)
plt.axis('equal')
plt.show()
Overall shot success rate: 45.61%
Spatial Analysis¶
Court location is expected to be one of the most important factors in predicting shot success. In this section, we'll explore how shot location affects shooting efficiency, visualize shot distribution across the court, and identify high-efficiency shooting zones.
# Shot success rate by court zone
zone_success = shots.groupby('court_zone')['shot_made'].agg(['count', 'mean']).reset_index()
zone_success.columns = ['court_zone', 'shots', 'success_rate']
zone_success['success_rate'] = zone_success['success_rate'] * 100
zone_success = zone_success.sort_values('success_rate', ascending=False)
zone_success
| court_zone | shots | success_rate | |
|---|---|---|---|
| 3 | Restricted Area | 3507 | 63.986313 |
| 2 | Paint | 1747890 | 57.723999 |
| 0 | Long Mid-Range | 805287 | 40.428071 |
| 1 | Mid-Range | 860765 | 39.336869 |
| 4 | Three-Point | 1232642 | 36.152508 |
# Visualize shot success rate by court zone
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='court_zone', y='success_rate', data=zone_success, palette='viridis')
plt.title('Shot Success Rate by Court Zone', fontsize=16)
plt.xlabel('Court Zone', fontsize=14)
plt.ylabel('Success Rate (%)', fontsize=14)
plt.ylim(0, 100)
# Add value labels on top of bars
for i, v in enumerate(zone_success['success_rate']):
ax.text(i, v + 1, f"{v:.1f}%", ha='center', fontsize=10)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/3334514108.py:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='court_zone', y='success_rate', data=zone_success, palette='viridis')
# Shot volume by court zone
zone_volume = zone_success.sort_values('shots', ascending=False)
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='court_zone', y='shots', data=zone_volume, palette='plasma')
plt.title('Shot Volume by Court Zone', fontsize=16)
plt.xlabel('Court Zone', fontsize=14)
plt.ylabel('Number of Shots', fontsize=14)
# Add value labels on top of bars
for i, v in enumerate(zone_volume['shots']):
ax.text(i, v + 1000, f"{v:,}", ha='center', fontsize=10)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/3606593768.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='court_zone', y='shots', data=zone_volume, palette='plasma')
# Function to draw basketball court
def draw_court(ax=None, color='black', lw=2, outer_lines=False):
if ax is None:
ax = plt.gca()
# Create the basketball hoop
hoop = patches.Circle((0, 0), radius=7.5/12, linewidth=lw, color=color, fill=False)
# Create the backboard
backboard = patches.Rectangle((-3, -0.5), 6, 0, linewidth=lw, color=color)
# The paint
outer_box = patches.Rectangle((-8, -19), 16, 19, linewidth=lw, color=color, fill=False)
# Free throw circle
free_throw = patches.Circle((0, -19), radius=6, linewidth=lw, color=color, fill=False)
# Restricted area
restricted = patches.Circle((0, 0), radius=4, linewidth=lw, color=color, fill=False)
# Three point line
corner_three_a = patches.Rectangle((-22, -47), 0, 14, linewidth=lw, color=color)
corner_three_b = patches.Rectangle((22, -47), 0, 14, linewidth=lw, color=color)
three_arc = patches.Arc((0, 0), 47*2, 47*2, theta1=12, theta2=168, linewidth=lw, color=color)
# Center court
center_outer = patches.Circle((0, -47), radius=6, linewidth=lw, color=color, fill=False)
# Add the court elements to the plot
court_elements = [hoop, backboard, outer_box, free_throw, restricted,
corner_three_a, corner_three_b, three_arc, center_outer]
if outer_lines:
# Draw the half court line
outer_lines = patches.Rectangle((-47, -47), 94, 50, linewidth=lw, color=color, fill=False)
court_elements.append(outer_lines)
for element in court_elements:
ax.add_patch(element)
ax.set_xlim(-25, 25)
ax.set_ylim(-5, 45)
return ax
# Shot distribution heatmap
plt.figure(figsize=(12, 11))
ax = plt.gca()
# Draw the court
draw_court(ax, color='black', lw=1.5)
# Create a 2D histogram
h = plt.hist2d(shots['loc_x'], shots['loc_y'], bins=50, cmap='viridis', alpha=0.8)
# Add a colorbar
plt.colorbar(h[3], ax=ax, label='Number of Shots')
plt.title('Shot Distribution Heatmap', fontsize=18)
plt.axis('off')
plt.tight_layout()
plt.show()
# Shot success rate by distance
distance_bins = [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40]
shots['distance_bin'] = pd.cut(shots['shot_distance'], bins=distance_bins)
distance_success = shots.groupby('distance_bin')['shot_made'].agg(['count', 'mean']).reset_index()
distance_success.columns = ['distance_bin', 'shots', 'success_rate']
distance_success['success_rate'] = distance_success['success_rate'] * 100
plt.figure(figsize=(14, 6))
ax1 = plt.gca()
ax2 = ax1.twinx()
# Plot success rate
sns.lineplot(x=distance_success.index, y='success_rate', data=distance_success, marker='o', color='blue', linewidth=3, ax=ax1)
ax1.set_ylabel('Success Rate (%)', color='blue', fontsize=14)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_ylim(0, 100)
# Plot shot volume
sns.barplot(x=distance_success.index, y='shots', data=distance_success, alpha=0.3, color='gray', ax=ax2)
ax2.set_ylabel('Number of Shots', color='gray', fontsize=14)
ax2.tick_params(axis='y', labelcolor='gray')
# Set x-axis labels
plt.xticks(range(len(distance_success)), [f"{b.left:.0f}-{b.right:.0f} ft" for b in distance_success['distance_bin']], rotation=45)
plt.title('Shot Success Rate and Volume by Distance', fontsize=16)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/813516144.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
distance_success = shots.groupby('distance_bin')['shot_made'].agg(['count', 'mean']).reset_index()
# Corner three analysis
corner_three_analysis = shots.groupby('corner_three')['shot_made'].agg(['count', 'mean']).reset_index()
corner_three_analysis.columns = ['corner_three', 'shots', 'success_rate']
corner_three_analysis['success_rate'] = corner_three_analysis['success_rate'] * 100
corner_three_analysis['three_type'] = corner_three_analysis['corner_three'].map({0: 'Above Break Three', 1: 'Corner Three'})
# Filter to only three-point shots
three_point_shots = shots[shots['court_zone'] == 'Three-Point']
corner_three_analysis = three_point_shots.groupby('corner_three')['shot_made'].agg(['count', 'mean']).reset_index()
corner_three_analysis.columns = ['corner_three', 'shots', 'success_rate']
corner_three_analysis['success_rate'] = corner_three_analysis['success_rate'] * 100
corner_three_analysis['three_type'] = corner_three_analysis['corner_three'].map({0: 'Above Break Three', 1: 'Corner Three'})
plt.figure(figsize=(10, 6))
ax = sns.barplot(x='three_type', y='success_rate', data=corner_three_analysis, palette='cool')
plt.title('Corner Three vs. Above Break Three Success Rate', fontsize=16)
plt.xlabel('Three-Point Shot Type', fontsize=14)
plt.ylabel('Success Rate (%)', fontsize=14)
plt.ylim(0, 50)
# Add value labels on top of bars
for i, v in enumerate(corner_three_analysis['success_rate']):
ax.text(i, v + 0.5, f"{v:.1f}%", ha='center', fontsize=12)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/720207946.py:15: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='three_type', y='success_rate', data=corner_three_analysis, palette='cool')
# Expected points per shot by zone
shots['points'] = np.where(shots['court_zone'] == 'Three-Point', 3 * shots['shot_made'], 2 * shots['shot_made'])
zone_points = shots.groupby('court_zone')['points'].mean().reset_index()
zone_points.columns = ['court_zone', 'expected_points']
zone_points = zone_points.sort_values('expected_points', ascending=False)
plt.figure(figsize=(12, 6))
ax = sns.barplot(x='court_zone', y='expected_points', data=zone_points, palette='viridis')
plt.title('Expected Points per Shot by Court Zone', fontsize=16)
plt.xlabel('Court Zone', fontsize=14)
plt.ylabel('Expected Points per Shot', fontsize=14)
# Add value labels on top of bars
for i, v in enumerate(zone_points['expected_points']):
ax.text(i, v + 0.02, f"{v:.2f}", ha='center', fontsize=10)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/2018263103.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='court_zone', y='expected_points', data=zone_points, palette='viridis')
Key Findings from Spatial Analysis:¶
- Shot success decreases significantly with distance from the basket
- The restricted area has the highest success rate (~60-65%)
- Three-point shots have lower success rates (~35-40%) but are taken frequently due to their higher point value
- Corner three-pointers are more efficient than other three-point shots
- Mid-range shots (16-24 feet) have relatively low efficiency for their point value
- When considering expected points per shot, three-point shots and shots in the restricted area provide the highest value
Temporal Analysis¶
Basketball is a dynamic game where timing matters. In this section, we'll explore how time-related factors like quarter, time remaining, and game situation affect shooting performance.
# Shot success by quarter
quarter_success = shots.groupby('quarter')['shot_made'].agg(['count', 'mean']).reset_index()
quarter_success.columns = ['quarter', 'shots', 'success_rate']
quarter_success['success_rate'] = quarter_success['success_rate'] * 100
quarter_success
| quarter | shots | success_rate | |
|---|---|---|---|
| 0 | 1 | 1204859 | 46.540633 |
| 1 | 2 | 1168538 | 45.892731 |
| 2 | 3 | 1134704 | 45.628023 |
| 3 | 4 | 1110001 | 44.420230 |
| 4 | 5 | 27460 | 41.238165 |
| 5 | 6 | 3839 | 40.479291 |
| 6 | 7 | 611 | 39.770867 |
| 7 | 8 | 79 | 44.303797 |
# Visualize shot success by quarter
plt.figure(figsize=(12, 6))
ax1 = plt.gca()
ax2 = ax1.twinx()
# Plot success rate
sns.lineplot(x='quarter', y='success_rate', data=quarter_success, marker='o', color='blue', linewidth=3, ax=ax1)
ax1.set_ylabel('Success Rate (%)', color='blue', fontsize=14)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_ylim(30, 50)
# Plot shot volume
sns.barplot(x='quarter', y='shots', data=quarter_success, alpha=0.3, color='gray', ax=ax2)
ax2.set_ylabel('Number of Shots', color='gray', fontsize=14)
ax2.tick_params(axis='y', labelcolor='gray')
plt.title('Shot Success Rate and Volume by Quarter', fontsize=16)
plt.xlabel('Quarter', fontsize=14)
plt.tight_layout()
plt.show()
# Shot success by time remaining in quarter
if 'time_remaining_seconds' in shots.columns:
# Create time bins (1 minute intervals)
time_bins = list(range(0, 721, 60))
shots['time_bin'] = pd.cut(shots['time_remaining_seconds'], bins=time_bins, right=False)
time_success = shots.groupby('time_bin')['shot_made'].agg(['count', 'mean']).reset_index()
time_success.columns = ['time_bin', 'shots', 'success_rate']
time_success['success_rate'] = time_success['success_rate'] * 100
time_success['minutes_remaining'] = [(b.left / 60) for b in time_success['time_bin']]
plt.figure(figsize=(14, 6))
ax1 = plt.gca()
ax2 = ax1.twinx()
# Plot success rate
sns.lineplot(x='minutes_remaining', y='success_rate', data=time_success, marker='o', color='blue', linewidth=2, ax=ax1)
ax1.set_ylabel('Success Rate (%)', color='blue', fontsize=14)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_ylim(35, 50)
# Plot shot volume
sns.barplot(x='minutes_remaining', y='shots', data=time_success, alpha=0.3, color='gray', ax=ax2)
ax2.set_ylabel('Number of Shots', color='gray', fontsize=14)
ax2.tick_params(axis='y', labelcolor='gray')
plt.title('Shot Success Rate and Volume by Time Remaining in Quarter', fontsize=16)
plt.xlabel('Minutes Remaining', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/2077949954.py:7: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
time_success = shots.groupby('time_bin')['shot_made'].agg(['count', 'mean']).reset_index()
# Clutch situation analysis
if 'clutch_situation' in shots.columns:
clutch_analysis = shots.groupby('clutch_situation')['shot_made'].agg(['count', 'mean']).reset_index()
clutch_analysis.columns = ['clutch_situation', 'shots', 'success_rate']
clutch_analysis['success_rate'] = clutch_analysis['success_rate'] * 100
clutch_analysis['situation'] = clutch_analysis['clutch_situation'].map({0: 'Regular', 1: 'Clutch'})
plt.figure(figsize=(10, 6))
ax = sns.barplot(x='situation', y='success_rate', data=clutch_analysis, palette='coolwarm')
plt.title('Shot Success Rate: Clutch vs. Regular Situations', fontsize=16)
plt.xlabel('Game Situation', fontsize=14)
plt.ylabel('Success Rate (%)', fontsize=14)
plt.ylim(30, 50)
# Add value labels on top of bars
for i, v in enumerate(clutch_analysis['success_rate']):
ax.text(i, v + 0.5, f"{v:.1f}%", ha='center', fontsize=12)
plt.tight_layout()
plt.show()
# Shot type distribution in clutch vs regular situations
if 'court_zone' in shots.columns:
clutch_zone = pd.crosstab(shots['clutch_situation'], shots['court_zone'], normalize='index') * 100
clutch_zone = clutch_zone.reset_index()
clutch_zone['situation'] = clutch_zone['clutch_situation'].map({0: 'Regular', 1: 'Clutch'})
plt.figure(figsize=(14, 6))
clutch_zone_melted = pd.melt(clutch_zone, id_vars=['clutch_situation', 'situation'], var_name='court_zone', value_name='percentage')
sns.barplot(x='court_zone', y='percentage', hue='situation', data=clutch_zone_melted, palette='coolwarm')
plt.title('Shot Distribution by Court Zone: Clutch vs. Regular Situations', fontsize=16)
plt.xlabel('Court Zone', fontsize=14)
plt.ylabel('Percentage of Shots (%)', fontsize=14)
plt.legend(title='Situation')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# End of period analysis
if 'end_of_period' in shots.columns:
end_period_analysis = shots.groupby('end_of_period')['shot_made'].agg(['count', 'mean']).reset_index()
end_period_analysis.columns = ['end_of_period', 'shots', 'success_rate']
end_period_analysis['success_rate'] = end_period_analysis['success_rate'] * 100
end_period_analysis['situation'] = end_period_analysis['end_of_period'].map({0: 'Regular', 1: 'End of Period'})
plt.figure(figsize=(10, 6))
ax = sns.barplot(x='situation', y='success_rate', data=end_period_analysis, palette='coolwarm')
plt.title('Shot Success Rate: End of Period vs. Regular Situations', fontsize=16)
plt.xlabel('Game Situation', fontsize=14)
plt.ylabel('Success Rate (%)', fontsize=14)
plt.ylim(30, 50)
# Add value labels on top of bars
for i, v in enumerate(end_period_analysis['success_rate']):
ax.text(i, v + 0.5, f"{v:.1f}%", ha='center', fontsize=12)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/3372016973.py:9: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='situation', y='success_rate', data=end_period_analysis, palette='coolwarm')
Key Findings from Temporal Analysis:¶
- Shot success decreases slightly in the final minutes of quarters
- Fourth quarter and overtime shots have lower success rates than earlier quarters
- Clutch situations (close games in final minutes) show reduced shooting efficiency
- Teams tend to take more three-point shots in late-game situations when trailing
- End-of-period shots have significantly lower success rates, likely due to rushed or contested attempts
- Shot volume increases in the final minutes of quarters, particularly in the fourth quarter
Player Analysis¶
Different players have different shooting abilities and tendencies. In this section, we'll explore how player characteristics and career stage affect shooting performance, and identify patterns in player shooting behavior.
# Top players by shooting efficiency (min 500 shots)
player_efficiency = shots.groupby('player_name').agg(
shots=('shot_made', 'count'),
makes=('shot_made', 'sum')
).reset_index()
player_efficiency['efficiency'] = player_efficiency['makes'] / player_efficiency['shots'] * 100
player_efficiency = player_efficiency[player_efficiency['shots'] >= 500].sort_values('efficiency', ascending=False).head(10)
plt.figure(figsize=(14, 6))
ax = sns.barplot(x='player_name', y='efficiency', data=player_efficiency, palette='viridis')
plt.title('Top 10 Players by Shooting Efficiency (min 500 shots)', fontsize=16)
plt.xlabel('Player', fontsize=14)
plt.ylabel('Shooting Efficiency (%)', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.ylim(40, 60)
# Add value labels on top of bars
for i, v in enumerate(player_efficiency['efficiency']):
ax.text(i, v + 0.5, f"{v:.1f}%", ha='center', fontsize=10)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/3093237816.py:10: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='player_name', y='efficiency', data=player_efficiency, palette='viridis')
# Top players by shot volume
player_volume = player_efficiency.sort_values('shots', ascending=False).head(10)
plt.figure(figsize=(14, 6))
ax = sns.barplot(x='player_name', y='shots', data=player_volume, palette='plasma')
plt.title('Top 10 Players by Shot Volume', fontsize=16)
plt.xlabel('Player', fontsize=14)
plt.ylabel('Number of Shots', fontsize=14)
plt.xticks(rotation=45, ha='right')
# Add value labels on top of bars
for i, v in enumerate(player_volume['shots']):
ax.text(i, v + 100, f"{v:,}", ha='center', fontsize=10)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/1899066303.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='player_name', y='shots', data=player_volume, palette='plasma')
# Career stage analysis
if 'career_stage' in shots.columns:
career_success = shots.groupby('career_stage')['shot_made'].agg(['count', 'mean']).reset_index()
career_success.columns = ['career_stage', 'shots', 'success_rate']
career_success['success_rate'] = career_success['success_rate'] * 100
# Define the correct order for career stages
stage_order = ['Rookie (0-2)', 'Early Career (3-5)', 'Prime (6-9)', 'Veteran (10+)']
career_success['stage_order'] = career_success['career_stage'].map({stage: i for i, stage in enumerate(stage_order)})
career_success = career_success.sort_values('stage_order')
plt.figure(figsize=(12, 6))
ax1 = plt.gca()
ax2 = ax1.twinx()
# Plot success rate
sns.lineplot(x='career_stage', y='success_rate', data=career_success, marker='o', color='blue', linewidth=3, ax=ax1)
ax1.set_ylabel('Success Rate (%)', color='blue', fontsize=14)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_ylim(40, 50)
# Plot shot volume
sns.barplot(x='career_stage', y='shots', data=career_success, alpha=0.3, color='gray', ax=ax2)
ax2.set_ylabel('Number of Shots', color='gray', fontsize=14)
ax2.tick_params(axis='y', labelcolor='gray')
plt.title('Shot Success Rate and Volume by Career Stage', fontsize=16)
plt.xlabel('Career Stage', fontsize=14)
plt.tight_layout()
plt.show()
# Player shooting consistency analysis
if 'prior_pct' in shots.columns and 'shooting_pct' in shots.columns:
# Calculate correlation between prior season and current season shooting percentages
correlation = shots['prior_pct'].corr(shots['shooting_pct'])
# Create a scatter plot
plt.figure(figsize=(10, 8))
sns.scatterplot(x='prior_pct', y='shooting_pct', data=shots, alpha=0.3)
plt.title(f'Player Shooting Consistency: Prior vs. Current Season (r = {correlation:.3f})', fontsize=16)
plt.xlabel('Prior Season Shooting %', fontsize=14)
plt.ylabel('Current Season Shooting %', fontsize=14)
# Add a diagonal line representing perfect consistency
x = np.linspace(0, 1, 100)
plt.plot(x, x, 'r--', alpha=0.7)
# Add a regression line
sns.regplot(x='prior_pct', y='shooting_pct', data=shots, scatter=False, line_kws={'color': 'green'})
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()
/Users/luke/src/github.com/lukelittle/csca5642-final-project/venv/lib/python3.10/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /Users/luke/src/github.com/lukelittle/csca5642-final-project/venv/lib/python3.10/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
# Player efficiency by court zone
player_zone_efficiency = shots.groupby(['player_name', 'court_zone']).agg(
shots=('shot_made', 'count'),
makes=('shot_made', 'sum')
).reset_index()
player_zone_efficiency['efficiency'] = player_zone_efficiency['makes'] / player_zone_efficiency['shots'] * 100
# Filter to players with at least 100 shots in each zone
player_zone_counts = player_zone_efficiency.groupby('player_name')['shots'].sum().reset_index()
top_players = player_zone_counts[player_zone_counts['shots'] >= 500]['player_name'].tolist()
top_player_zones = player_zone_efficiency[player_zone_efficiency['player_name'].isin(top_players[:5])]
plt.figure(figsize=(14, 8))
sns.barplot(x='court_zone', y='efficiency', hue='player_name', data=top_player_zones)
plt.title('Shooting Efficiency by Court Zone for Top Players', fontsize=16)
plt.xlabel('Court Zone', fontsize=14)
plt.ylabel('Shooting Efficiency (%)', fontsize=14)
plt.ylim(0, 100)
plt.legend(title='Player')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Key Findings from Player Analysis:¶
- Players in their prime years (6-9 seasons) tend to have the highest shooting efficiency
- A player's historical shooting percentage in a zone is a strong predictor of future success
- Shooting specialists often have higher efficiency but lower volume than star players
- Player shooting patterns are remarkably consistent year-to-year
- Different players show distinct shooting profiles across court zones, reflecting their playing styles and roles
Game Context Analysis¶
The situation in which a shot is taken can significantly impact its likelihood of success. In this section, we'll explore how factors like score margin, home/away status, and other situational variables affect shooting performance.
# Shot success by score situation
if 'score_situation' in shots.columns:
score_success = shots.groupby('score_situation')['shot_made'].agg(['count', 'mean']).reset_index()
score_success.columns = ['score_situation', 'shots', 'success_rate']
score_success['success_rate'] = score_success['success_rate'] * 100
# Define the correct order for score situations
situation_order = ['Large Deficit', 'Moderate Deficit', 'Small Deficit', 'Tied',
'Small Lead', 'Moderate Lead', 'Large Lead']
score_success['situation_order'] = score_success['score_situation'].map({sit: i for i, sit in enumerate(situation_order)})
score_success = score_success.sort_values('situation_order')
plt.figure(figsize=(14, 6))
ax1 = plt.gca()
ax2 = ax1.twinx()
# Plot success rate
sns.lineplot(x='score_situation', y='success_rate', data=score_success, marker='o', color='blue', linewidth=3, ax=ax1)
ax1.set_ylabel('Success Rate (%)', color='blue', fontsize=14)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_ylim(40, 50)
# Plot shot volume
sns.barplot(x='score_situation', y='shots', data=score_success, alpha=0.3, color='gray', ax=ax2)
ax2.set_ylabel('Number of Shots', color='gray', fontsize=14)
ax2.tick_params(axis='y', labelcolor='gray')
plt.title('Shot Success Rate and Volume by Score Situation', fontsize=16)
plt.xlabel('Score Situation', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Home vs. Away analysis
if 'HOME_TEAM' in shots.columns and 'AWAY_TEAM' in shots.columns and 'team_name' in shots.columns:
# Determine if the shot was taken by the home or away team
shots['is_home'] = shots['team_name'] == shots['HOME_TEAM']
home_away_success = shots.groupby('is_home')['shot_made'].agg(['count', 'mean']).reset_index()
home_away_success.columns = ['is_home', 'shots', 'success_rate']
home_away_success['success_rate'] = home_away_success['success_rate'] * 100
home_away_success['location'] = home_away_success['is_home'].map({True: 'Home', False: 'Away'})
plt.figure(figsize=(10, 6))
ax = sns.barplot(x='location', y='success_rate', data=home_away_success, palette='Set2')
plt.title('Shot Success Rate: Home vs. Away', fontsize=16)
plt.xlabel('Team Location', fontsize=14)
plt.ylabel('Success Rate (%)', fontsize=14)
plt.ylim(40, 50)
# Add value labels on top of bars
for i, v in enumerate(home_away_success['success_rate']):
ax.text(i, v + 0.2, f"{v:.1f}%", ha='center', fontsize=12)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/3504839406.py:12: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='location', y='success_rate', data=home_away_success, palette='Set2')
# Shot type distribution by score situation
if 'score_situation' in shots.columns and 'court_zone' in shots.columns:
# Create a cross-tabulation of score situation and court zone
score_zone = pd.crosstab(shots['score_situation'], shots['court_zone'], normalize='index') * 100
score_zone = score_zone.reset_index()
# Reorder the score situations
situation_order = ['Large Deficit', 'Moderate Deficit', 'Small Deficit', 'Tied',
'Small Lead', 'Moderate Lead', 'Large Lead']
score_zone['situation_order'] = score_zone['score_situation'].map({sit: i for i, sit in enumerate(situation_order) if sit in score_zone['score_situation'].values})
score_zone = score_zone.sort_values('situation_order')
# Melt the dataframe for easier plotting
score_zone_melted = pd.melt(score_zone, id_vars=['score_situation', 'situation_order'],
value_vars=[col for col in score_zone.columns if col not in ['score_situation', 'situation_order']],
var_name='court_zone', value_name='percentage')
plt.figure(figsize=(16, 8))
sns.barplot(x='score_situation', y='percentage', hue='court_zone', data=score_zone_melted)
plt.title('Shot Distribution by Court Zone Across Score Situations', fontsize=16)
plt.xlabel('Score Situation', fontsize=14)
plt.ylabel('Percentage of Shots (%)', fontsize=14)
plt.legend(title='Court Zone')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Team performance analysis
if 'win_pct' in shots.columns:
# Create bins for team winning percentage
win_pct_bins = [0, 0.3, 0.4, 0.5, 0.6, 0.7, 1.0]
win_pct_labels = ['<30%', '30-40%', '40-50%', '50-60%', '60-70%', '>70%']
shots['win_pct_bin'] = pd.cut(shots['win_pct'], bins=win_pct_bins, labels=win_pct_labels)
team_success = shots.groupby('win_pct_bin')['shot_made'].agg(['count', 'mean']).reset_index()
team_success.columns = ['win_pct_bin', 'shots', 'success_rate']
team_success['success_rate'] = team_success['success_rate'] * 100
plt.figure(figsize=(12, 6))
ax1 = plt.gca()
ax2 = ax1.twinx()
# Plot success rate
sns.lineplot(x='win_pct_bin', y='success_rate', data=team_success, marker='o', color='blue', linewidth=3, ax=ax1)
ax1.set_ylabel('Success Rate (%)', color='blue', fontsize=14)
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_ylim(40, 50)
# Plot shot volume
sns.barplot(x='win_pct_bin', y='shots', data=team_success, alpha=0.3, color='gray', ax=ax2)
ax2.set_ylabel('Number of Shots', color='gray', fontsize=14)
ax2.tick_params(axis='y', labelcolor='gray')
plt.title('Shot Success Rate by Team Winning Percentage', fontsize=16)
plt.xlabel('Team Winning Percentage', fontsize=14)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/2968174060.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
team_success = shots.groupby('win_pct_bin')['shot_made'].agg(['count', 'mean']).reset_index()
Key Findings from Game Context Analysis:¶
- Home teams shoot slightly better than away teams
- Teams shoot better when leading than when trailing
- Shot success decreases in high-pressure situations (close games, late shot clock)
- Teams take more three-point shots when trailing by large margins
- Shot selection becomes more conservative in close games
- Better teams (higher win percentage) generally have higher shooting efficiency
Feature Importance Analysis¶
Understanding which features have the strongest relationship with shot success will inform our modeling approach. In this section, we'll analyze feature correlations and distributions to identify the most predictive features.
# Correlation of features with shot success
numeric_cols = shots.select_dtypes(include=['number']).columns.tolist()
if 'shot_made' in numeric_cols:
numeric_cols.remove('shot_made') # Remove target variable
# Calculate correlation with shot_made
correlations = []
for col in numeric_cols:
corr = shots[col].corr(shots['shot_made'])
correlations.append({'feature': col, 'correlation': corr})
corr_df = pd.DataFrame(correlations)
corr_df = corr_df.sort_values('correlation', key=abs, ascending=False).head(15)
corr_df
/Users/luke/src/github.com/lukelittle/csca5642-final-project/venv/lib/python3.10/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /Users/luke/src/github.com/lukelittle/csca5642-final-project/venv/lib/python3.10/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
| feature | correlation | |
|---|---|---|
| 26 | points | 0.970156 |
| 6 | SHOT_DISTANCE | -0.199835 |
| 18 | shooting_pct | 0.189238 |
| 11 | shot_distance | -0.169855 |
| 5 | loc_y | -0.140821 |
| 20 | true_shooting | 0.072523 |
| 13 | corner_three | -0.040546 |
| 24 | offensive_rating | 0.021204 |
| 16 | time_remaining_seconds | 0.017990 |
| 8 | MINS_LEFT | 0.016766 |
| 14 | mins_left | 0.016766 |
| 7 | quarter | -0.015955 |
| 9 | SECS_LEFT | 0.015457 |
| 15 | secs_left | 0.015457 |
| 17 | end_of_period | -0.015407 |
# Visualize feature correlations
plt.figure(figsize=(14, 8))
ax = sns.barplot(x='correlation', y='feature', data=corr_df, palette='coolwarm')
plt.title('Top 15 Features by Correlation with Shot Success', fontsize=16)
plt.xlabel('Correlation Coefficient', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.axvline(x=0, color='black', linestyle='--')
# Add value labels
for i, v in enumerate(corr_df['correlation']):
if v >= 0:
ax.text(v + 0.01, i, f"{v:.3f}", va='center', fontsize=10)
else:
ax.text(v - 0.06, i, f"{v:.3f}", va='center', fontsize=10)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/1986772004.py:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x='correlation', y='feature', data=corr_df, palette='coolwarm')
# Feature distributions for made vs. missed shots
top_features = corr_df.head(4)['feature'].tolist()
plt.figure(figsize=(16, 12))
for i, feature in enumerate(top_features):
plt.subplot(2, 2, i+1)
sns.histplot(data=shots, x=feature, hue='shot_made', bins=30, alpha=0.6, kde=True)
plt.title(f'Distribution of {feature} for Made vs. Missed Shots', fontsize=14)
plt.xlabel(feature, fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Shot Made', labels=['Missed', 'Made'])
plt.tight_layout()
plt.show()
# Shot success rate by distance and prior shooting percentage
if 'prior_pct' in shots.columns and 'shot_distance' in shots.columns:
# Create bins for prior shooting percentage
prior_pct_bins = [0, 0.3, 0.4, 0.5, 0.6, 1.0]
prior_pct_labels = ['<30%', '30-40%', '40-50%', '50-60%', '>60%']
shots['prior_pct_bin'] = pd.cut(shots['prior_pct'], bins=prior_pct_bins, labels=prior_pct_labels)
# Create distance bins
distance_bins = [0, 5, 10, 15, 20, 25, 30, 35]
shots['distance_bin_simple'] = pd.cut(shots['shot_distance'], bins=distance_bins)
# Calculate success rate by distance and prior percentage
success_by_dist_prior = shots.groupby(['distance_bin_simple', 'prior_pct_bin'])['shot_made'].mean().reset_index()
success_by_dist_prior['success_rate'] = success_by_dist_prior['shot_made'] * 100
# Pivot for heatmap
heatmap_data = success_by_dist_prior.pivot(index='distance_bin_simple', columns='prior_pct_bin', values='success_rate')
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, cmap='viridis', fmt='.1f')
plt.title('Shot Success Rate (%) by Distance and Prior Shooting Percentage', fontsize=16)
plt.xlabel('Prior Shooting Percentage', fontsize=14)
plt.ylabel('Shot Distance (ft)', fontsize=14)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_57264/41944496.py:13: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. success_by_dist_prior = shots.groupby(['distance_bin_simple', 'prior_pct_bin'])['shot_made'].mean().reset_index()
Key Findings from Feature Importance Analysis:¶
- Shot distance is the strongest predictor of shot success, with a strong negative correlation
- Player's historical shooting percentage in a zone is highly predictive of current success
- Game context features (score, time) have moderate predictive power
- Team features generally have less impact than player and spatial features
- The combination of shot distance and prior shooting percentage provides strong predictive power
- Player experience and career stage show meaningful correlations with shooting success
Summary of Key Insights¶
Our data exploration has revealed several important patterns that will guide our modeling approach:
Spatial factors have the strongest influence on shot success, with distance from basket being the most important predictor.
- Shot success decreases significantly with distance from the basket
- Corner three-pointers are more efficient than other three-point shots
- The restricted area has the highest success rate (~60-65%)
- When considering expected points per shot, three-point shots and shots at the rim provide the highest value
Player-specific factors like historical shooting percentages and career stage significantly impact shot outcomes.
- A player's prior shooting percentage in a zone is highly predictive of future success
- Players in their prime years (6-9 seasons) tend to have the highest shooting efficiency
- Different players show distinct shooting profiles across court zones
Game context creates meaningful variations in shot success, with pressure situations generally reducing efficiency.
- Teams shoot better when leading than when trailing
- Home teams shoot slightly better than away teams
- Shot success decreases in high-pressure situations (close games, late shot clock)
Temporal patterns show that shooting efficiency varies throughout games, with late-game situations presenting unique challenges.
- Shot success decreases slightly in the final minutes of quarters
- Fourth quarter and overtime shots have lower success rates than earlier quarters
- End-of-period shots have significantly lower success rates
These insights suggest that our modeling approach should:
- Incorporate spatial features as primary predictors
- Include player-specific historical performance metrics
- Account for game context and pressure situations
- Consider interactions between these feature groups
Next, we'll begin building our spatial model as the foundation of our shot prediction system.
DeepShot: Spatial Model¶
Introduction¶
In this notebook, we build our first predictive model - a spatial model that predicts shot success based on court location. As our exploratory data analysis revealed, spatial factors (particularly shot distance) have the strongest influence on shot success, making this a logical starting point for our modeling efforts.
We'll use a Convolutional Neural Network (CNN) approach for this task. While CNNs are typically used for image processing, they can be adapted to process spatial coordinates on the basketball court. The CNN architecture allows us to capture complex spatial patterns that might not be captured by simpler features like distance from the basket.
Our modeling process will include:
- Data Preparation: Preparing and normalizing spatial features
- Model Architecture: Designing a CNN architecture for spatial prediction
- Training: Training the model with appropriate regularization
- Evaluation: Assessing model performance on test data
- Visualization: Creating interpretable visualizations of model predictions
- Analysis: Deriving insights from the spatial patterns learned by the model
This spatial model will serve as the foundation for our shot prediction system, which we'll later enhance with player-specific and game context features.
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
from pathlib import Path
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers, callbacks
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score, confusion_matrix, classification_report
# Set visualization style
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
# Create directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
models_dir = Path('../models')
spatial_model_dir = models_dir / 'spatial_model'
for directory in [processed_dir, features_dir, models_dir, spatial_model_dir]:
directory.mkdir(parents=True, exist_ok=True)
# Check TensorFlow version and GPU availability
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
TensorFlow version: 2.15.0 GPU available: []
Court Visualization¶
To interpret our spatial data and model predictions, we need a function to visualize the basketball court. This visualization will help us understand the spatial patterns in our data and the predictions made by our model.
def draw_court(ax=None, color='black', lw=2, outer_lines=False):
# If an axes object isn't provided to plot onto, just get current one
if ax is None:
ax = plt.gca()
# Create the basketball hoop
hoop = patches.Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)
# Create backboard
backboard = patches.Rectangle((-30, -7.5), 60, 0, linewidth=lw, color=color)
# The paint
# Create the outer box of the paint, width=16ft, height=19ft
outer_box = patches.Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color, fill=False)
# Create the inner box of the paint, width=12ft, height=19ft
inner_box = patches.Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color, fill=False)
# Create free throw top arc
top_free_throw = patches.Arc((0, 142.5), 120, 120, theta1=0, theta2=180, linewidth=lw, color=color, fill=False)
# Create free throw bottom arc
bottom_free_throw = patches.Arc((0, 142.5), 120, 120, theta1=180, theta2=0, linewidth=lw, color=color, linestyle='dashed')
# Restricted Zone, it is an arc with 4ft radius from center of the hoop
restricted = patches.Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw, color=color)
# Three point line
# Create the side 3pt lines, they are 14ft long before they begin to arc
corner_three_a = patches.Rectangle((-220, -47.5), 0, 140, linewidth=lw, color=color)
corner_three_b = patches.Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)
# 3pt arc - center of arc will be the hoop, arc is 23'9" away from hoop
three_arc = patches.Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw, color=color)
# Center Court
center_outer_arc = patches.Arc((0, 422.5), 120, 120, theta1=180, theta2=0, linewidth=lw, color=color)
center_inner_arc = patches.Arc((0, 422.5), 40, 40, theta1=180, theta2=0, linewidth=lw, color=color)
# List of the court elements to be plotted onto the axes
court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw, bottom_free_throw, restricted,
corner_three_a, corner_three_b, three_arc, center_outer_arc, center_inner_arc]
if outer_lines:
# Draw the half court line, baseline and side out bound lines
outer_lines = patches.Rectangle((-250, -47.5), 500, 470, linewidth=lw, color=color, fill=False)
court_elements.append(outer_lines)
# Add the court elements onto the axes
for element in court_elements:
ax.add_patch(element)
return ax
Data Preparation¶
# Load shot data with features
shots = pd.read_csv(features_dir / 'shots_with_features.csv')
print(f"Loaded {len(shots)} shots")
# Define target variable and spatial features
target = 'shot_made'
spatial_features = ['loc_x', 'loc_y', 'shot_distance', 'shot_angle']
# Prepare data
model_data = shots[spatial_features + [target]].copy()
model_data = model_data.dropna()
# Split features and target
X = model_data.drop(target, axis=1)
y = model_data[target]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data: {X_train.shape}")
print(f"Testing data: {X_test.shape}")
Loaded 4650091 shots Training data: (3720072, 4) Testing data: (930019, 4)
# Normalize spatial coordinates
def normalize_coordinates(data):
data_norm = data.copy()
# Normalize x and y coordinates
x_min, x_max = -250, 250
y_min, y_max = -50, 450
data_norm['loc_x_norm'] = (data_norm['loc_x'] - x_min) / (x_max - x_min)
data_norm['loc_y_norm'] = (data_norm['loc_y'] - y_min) / (y_max - y_min)
# Normalize shot distance
max_distance = 500 # Maximum possible distance on court
data_norm['shot_distance_norm'] = data_norm['shot_distance'] / max_distance
# Normalize shot angle
data_norm['shot_angle_norm'] = (data_norm['shot_angle'] + 180) / 360
return data_norm
# Normalize training and testing data
X_train_norm = normalize_coordinates(X_train)
X_test_norm = normalize_coordinates(X_test)
# Prepare input data for CNN
feature_cols = ['loc_x_norm', 'loc_y_norm', 'shot_distance_norm', 'shot_angle_norm']
X_train_cnn = X_train_norm[feature_cols].values
X_test_cnn = X_test_norm[feature_cols].values
print(f"CNN input shape: {X_train_cnn.shape}")
CNN input shape: (3720072, 4)
Visualize Shot Distribution¶
# Create a sample of shots for visualization
sample_size = min(10000, len(X_train))
X_train_sample = X_train.sample(sample_size, random_state=42)
y_train_sample = y_train.loc[X_train_sample.index]
# Plot shot distribution
plt.figure(figsize=(10, 9))
ax = plt.gca()
draw_court(ax, outer_lines=True)
# Plot shots (made shots in green, missed in red)
made_shots = X_train_sample[y_train_sample == 1]
missed_shots = X_train_sample[y_train_sample == 0]
plt.scatter(made_shots['loc_x'], made_shots['loc_y'], c='green', alpha=0.3, s=10, label='Made')
plt.scatter(missed_shots['loc_x'], missed_shots['loc_y'], c='red', alpha=0.3, s=10, label='Missed')
plt.legend(loc='upper right')
plt.title('Shot Distribution in Training Data', fontsize=14)
plt.xlim(-250, 250)
plt.ylim(-50, 450)
plt.axis('off')
plt.tight_layout()
plt.show()
Model Architecture and Training¶
We'll use a Convolutional Neural Network (CNN) architecture for our spatial model. CNNs are well-suited for capturing spatial patterns, as they can learn local features and their spatial relationships. Our architecture will process the normalized spatial coordinates and output a probability of shot success.
# Define CNN model
def build_spatial_cnn_model(input_shape):
model = models.Sequential([
# Input layer
layers.Input(shape=input_shape),
# Reshape input to 2D grid (assuming first two features are x, y coordinates)
layers.Reshape((1, 1, input_shape[0])),
# Convolutional layers
layers.Conv2D(64, kernel_size=1, activation='relu'),
layers.Conv2D(128, kernel_size=1, activation='relu'),
layers.Conv2D(256, kernel_size=1, activation='relu'),
# Flatten layer
layers.Flatten(),
# Dense layers
layers.Dense(128, activation='relu'),
layers.Dropout(0.3),
layers.Dense(64, activation='relu'),
layers.Dropout(0.3),
# Output layer
layers.Dense(1, activation='sigmoid')
])
return model
# Build model
input_shape = (X_train_cnn.shape[1],)
model = build_spatial_cnn_model(input_shape)
# Compile model
model.compile(
optimizer=optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
# Display model summary
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
reshape (Reshape) (None, 1, 1, 4) 0
conv2d (Conv2D) (None, 1, 1, 64) 320
conv2d_1 (Conv2D) (None, 1, 1, 128) 8320
conv2d_2 (Conv2D) (None, 1, 1, 256) 33024
flatten (Flatten) (None, 256) 0
dense (Dense) (None, 128) 32896
dropout (Dropout) (None, 128) 0
dense_1 (Dense) (None, 64) 8256
dropout_1 (Dropout) (None, 64) 0
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 82881 (323.75 KB)
Trainable params: 82881 (323.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
# Define callbacks
early_stopping = callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
reduce_lr = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2,
patience=5,
min_lr=0.0001
)
model_checkpoint = callbacks.ModelCheckpoint(
filepath=str(spatial_model_dir / 'spatial_model_best.keras'),
monitor='val_loss',
save_best_only=True,
verbose=1
)
# Train model
history = model.fit(
X_train_cnn, y_train,
epochs=30, # Reduced for faster training
batch_size=128,
validation_split=0.2,
callbacks=[early_stopping, reduce_lr, model_checkpoint],
verbose=1
)
Epoch 1/30 23248/23251 [============================>.] - ETA: 0s - loss: 0.6677 - accuracy: 0.6062 Epoch 1: val_loss improved from inf to 0.66515, saving model to ../models/spatial_model/spatial_model_best.keras 23251/23251 [==============================] - 100s 4ms/step - loss: 0.6677 - accuracy: 0.6062 - val_loss: 0.6652 - val_accuracy: 0.6106 - lr: 0.0010 Epoch 2/30 23247/23251 [============================>.] - ETA: 0s - loss: 0.6661 - accuracy: 0.6093 Epoch 2: val_loss improved from 0.66515 to 0.66425, saving model to ../models/spatial_model/spatial_model_best.keras 23251/23251 [==============================] - 98s 4ms/step - loss: 0.6661 - accuracy: 0.6093 - val_loss: 0.6642 - val_accuracy: 0.6097 - lr: 0.0010 Epoch 3/30 23247/23251 [============================>.] - ETA: 0s - loss: 0.6657 - accuracy: 0.6102 Epoch 3: val_loss did not improve from 0.66425 23251/23251 [==============================] - 96s 4ms/step - loss: 0.6657 - accuracy: 0.6102 - val_loss: 0.6646 - val_accuracy: 0.6104 - lr: 0.0010 Epoch 4/30 23249/23251 [============================>.] - ETA: 0s - loss: 0.6655 - accuracy: 0.6104 Epoch 4: val_loss did not improve from 0.66425 23251/23251 [==============================] - 115s 5ms/step - loss: 0.6655 - accuracy: 0.6104 - val_loss: 0.6643 - val_accuracy: 0.6117 - lr: 0.0010 Epoch 5/30 23244/23251 [============================>.] - ETA: 0s - loss: 0.6655 - accuracy: 0.6106 Epoch 5: val_loss did not improve from 0.66425 23251/23251 [==============================] - 93s 4ms/step - loss: 0.6655 - accuracy: 0.6106 - val_loss: 0.6643 - val_accuracy: 0.6105 - lr: 0.0010 Epoch 6/30 23249/23251 [============================>.] - ETA: 0s - loss: 0.6653 - accuracy: 0.6110 Epoch 6: val_loss did not improve from 0.66425 23251/23251 [==============================] - 103s 4ms/step - loss: 0.6653 - accuracy: 0.6110 - val_loss: 0.6650 - val_accuracy: 0.6103 - lr: 0.0010 Epoch 7/30 23245/23251 [============================>.] - ETA: 0s - loss: 0.6653 - accuracy: 0.6111 Epoch 7: val_loss did not improve from 0.66425 23251/23251 [==============================] - 91s 4ms/step - loss: 0.6653 - accuracy: 0.6111 - val_loss: 0.6654 - val_accuracy: 0.6098 - lr: 0.0010 Epoch 8/30 23243/23251 [============================>.] - ETA: 0s - loss: 0.6640 - accuracy: 0.6128 Epoch 8: val_loss improved from 0.66425 to 0.66308, saving model to ../models/spatial_model/spatial_model_best.keras 23251/23251 [==============================] - 86s 4ms/step - loss: 0.6640 - accuracy: 0.6128 - val_loss: 0.6631 - val_accuracy: 0.6133 - lr: 2.0000e-04 Epoch 9/30 23250/23251 [============================>.] - ETA: 0s - loss: 0.6639 - accuracy: 0.6130 Epoch 9: val_loss did not improve from 0.66308 23251/23251 [==============================] - 89s 4ms/step - loss: 0.6639 - accuracy: 0.6130 - val_loss: 0.6632 - val_accuracy: 0.6135 - lr: 2.0000e-04 Epoch 10/30 23245/23251 [============================>.] - ETA: 0s - loss: 0.6638 - accuracy: 0.6131 Epoch 10: val_loss improved from 0.66308 to 0.66304, saving model to ../models/spatial_model/spatial_model_best.keras 23251/23251 [==============================] - 81s 3ms/step - loss: 0.6638 - accuracy: 0.6131 - val_loss: 0.6630 - val_accuracy: 0.6130 - lr: 2.0000e-04 Epoch 11/30 23241/23251 [============================>.] - ETA: 0s - loss: 0.6638 - accuracy: 0.6132 Epoch 11: val_loss improved from 0.66304 to 0.66286, saving model to ../models/spatial_model/spatial_model_best.keras 23251/23251 [==============================] - 110s 5ms/step - loss: 0.6638 - accuracy: 0.6132 - val_loss: 0.6629 - val_accuracy: 0.6133 - lr: 2.0000e-04 Epoch 12/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6638 - accuracy: 0.6132 Epoch 12: val_loss improved from 0.66286 to 0.66279, saving model to ../models/spatial_model/spatial_model_best.keras 23251/23251 [==============================] - 112s 5ms/step - loss: 0.6638 - accuracy: 0.6132 - val_loss: 0.6628 - val_accuracy: 0.6134 - lr: 2.0000e-04 Epoch 13/30 23243/23251 [============================>.] - ETA: 0s - loss: 0.6637 - accuracy: 0.6133 Epoch 13: val_loss did not improve from 0.66279 23251/23251 [==============================] - 99s 4ms/step - loss: 0.6638 - accuracy: 0.6133 - val_loss: 0.6631 - val_accuracy: 0.6128 - lr: 2.0000e-04 Epoch 14/30 23240/23251 [============================>.] - ETA: 0s - loss: 0.6637 - accuracy: 0.6132 Epoch 14: val_loss did not improve from 0.66279 23251/23251 [==============================] - 80s 3ms/step - loss: 0.6637 - accuracy: 0.6132 - val_loss: 0.6630 - val_accuracy: 0.6129 - lr: 2.0000e-04 Epoch 15/30 23248/23251 [============================>.] - ETA: 0s - loss: 0.6637 - accuracy: 0.6132 Epoch 15: val_loss did not improve from 0.66279 23251/23251 [==============================] - 92s 4ms/step - loss: 0.6637 - accuracy: 0.6132 - val_loss: 0.6635 - val_accuracy: 0.6132 - lr: 2.0000e-04 Epoch 16/30 23250/23251 [============================>.] - ETA: 0s - loss: 0.6637 - accuracy: 0.6133 Epoch 16: val_loss did not improve from 0.66279 23251/23251 [==============================] - 103s 4ms/step - loss: 0.6637 - accuracy: 0.6133 - val_loss: 0.6628 - val_accuracy: 0.6136 - lr: 2.0000e-04 Epoch 17/30 23248/23251 [============================>.] - ETA: 0s - loss: 0.6634 - accuracy: 0.6134 Epoch 17: val_loss improved from 0.66279 to 0.66263, saving model to ../models/spatial_model/spatial_model_best.keras 23251/23251 [==============================] - 100s 4ms/step - loss: 0.6634 - accuracy: 0.6134 - val_loss: 0.6626 - val_accuracy: 0.6137 - lr: 1.0000e-04 Epoch 18/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6634 - accuracy: 0.6134 Epoch 18: val_loss did not improve from 0.66263 23251/23251 [==============================] - 112s 5ms/step - loss: 0.6634 - accuracy: 0.6135 - val_loss: 0.6633 - val_accuracy: 0.6127 - lr: 1.0000e-04 Epoch 19/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6633 - accuracy: 0.6135 Epoch 19: val_loss did not improve from 0.66263 23251/23251 [==============================] - 133s 6ms/step - loss: 0.6633 - accuracy: 0.6135 - val_loss: 0.6627 - val_accuracy: 0.6135 - lr: 1.0000e-04 Epoch 20/30 23247/23251 [============================>.] - ETA: 0s - loss: 0.6634 - accuracy: 0.6135 Epoch 20: val_loss did not improve from 0.66263 23251/23251 [==============================] - 105s 5ms/step - loss: 0.6634 - accuracy: 0.6135 - val_loss: 0.6630 - val_accuracy: 0.6131 - lr: 1.0000e-04 Epoch 21/30 23247/23251 [============================>.] - ETA: 0s - loss: 0.6634 - accuracy: 0.6134 Epoch 21: val_loss did not improve from 0.66263 23251/23251 [==============================] - 113s 5ms/step - loss: 0.6634 - accuracy: 0.6134 - val_loss: 0.6628 - val_accuracy: 0.6135 - lr: 1.0000e-04 Epoch 22/30 23249/23251 [============================>.] - ETA: 0s - loss: 0.6634 - accuracy: 0.6135 Epoch 22: val_loss did not improve from 0.66263 23251/23251 [==============================] - 115s 5ms/step - loss: 0.6634 - accuracy: 0.6135 - val_loss: 0.6629 - val_accuracy: 0.6137 - lr: 1.0000e-04 Epoch 23/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6634 - accuracy: 0.6135 Epoch 23: val_loss did not improve from 0.66263 23251/23251 [==============================] - 108s 5ms/step - loss: 0.6634 - accuracy: 0.6135 - val_loss: 0.6627 - val_accuracy: 0.6136 - lr: 1.0000e-04 Epoch 24/30 23250/23251 [============================>.] - ETA: 0s - loss: 0.6633 - accuracy: 0.6135 Epoch 24: val_loss did not improve from 0.66263 23251/23251 [==============================] - 128s 6ms/step - loss: 0.6633 - accuracy: 0.6135 - val_loss: 0.6627 - val_accuracy: 0.6137 - lr: 1.0000e-04 Epoch 25/30 23249/23251 [============================>.] - ETA: 0s - loss: 0.6633 - accuracy: 0.6135 Epoch 25: val_loss did not improve from 0.66263 23251/23251 [==============================] - 121s 5ms/step - loss: 0.6633 - accuracy: 0.6135 - val_loss: 0.6627 - val_accuracy: 0.6134 - lr: 1.0000e-04 Epoch 26/30 23240/23251 [============================>.] - ETA: 0s - loss: 0.6633 - accuracy: 0.6135 Epoch 26: val_loss did not improve from 0.66263 23251/23251 [==============================] - 99s 4ms/step - loss: 0.6633 - accuracy: 0.6135 - val_loss: 0.6628 - val_accuracy: 0.6133 - lr: 1.0000e-04 Epoch 27/30 23250/23251 [============================>.] - ETA: 0s - loss: 0.6633 - accuracy: 0.6135 Epoch 27: val_loss did not improve from 0.66263 23251/23251 [==============================] - 89s 4ms/step - loss: 0.6633 - accuracy: 0.6135 - val_loss: 0.6626 - val_accuracy: 0.6134 - lr: 1.0000e-04
Model Evaluation¶
After training our model, we need to evaluate its performance on held-out test data. This evaluation will help us understand how well our model generalizes to new shots and identify any issues like overfitting.
# Plot training history
plt.figure(figsize=(12, 5))
# Plot loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
# Evaluate model on test data
test_loss, test_accuracy = model.evaluate(X_test_cnn, y_test)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
# Generate predictions
y_pred_prob = model.predict(X_test_cnn)
y_pred = (y_pred_prob > 0.5).astype(int).flatten()
# Calculate metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()
29064/29064 [==============================] - 32s 1ms/step - loss: 0.6630 - accuracy: 0.6131
Test Loss: 0.6630
Test Accuracy: 0.6131
29064/29064 [==============================] - 31s 1ms/step
Classification Report:
precision recall f1-score support
False 0.61 0.79 0.69 505457
True 0.62 0.40 0.49 424562
accuracy 0.61 930019
macro avg 0.61 0.60 0.59 930019
weighted avg 0.61 0.61 0.60 930019
Prediction Visualization¶
To understand what our model has learned about spatial patterns in basketball shooting, we'll create visualizations of its predictions across the court. These visualizations will help us interpret the model and derive strategic insights.
# Create a grid of court locations
grid_size = 50
x_min, x_max = -250, 250
y_min, y_max = -50, 450
x_grid = np.linspace(x_min, x_max, grid_size)
y_grid = np.linspace(y_min, y_max, grid_size)
X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
# Create input data for each grid point
grid_points = []
for i in range(grid_size):
for j in range(grid_size):
x = X_grid[i, j]
y = Y_grid[i, j]
distance = np.sqrt(x**2 + y**2)
angle = np.arctan2(x, y) * 180 / np.pi
grid_points.append([x, y, distance, angle])
grid_df = pd.DataFrame(grid_points, columns=spatial_features)
grid_norm = normalize_coordinates(grid_df)
grid_input = grid_norm[feature_cols].values
# Generate predictions for grid points
grid_pred = model.predict(grid_input).reshape(grid_size, grid_size)
# Plot prediction heatmap
plt.figure(figsize=(10, 9))
ax = plt.gca()
draw_court(ax, outer_lines=True)
# Plot heatmap
plt.imshow(grid_pred, origin='lower', extent=[x_min, x_max, y_min, y_max],
cmap='RdYlGn', vmin=0, vmax=1, alpha=0.7)
plt.colorbar(label='Predicted Shot Success Probability')
plt.title('Predicted Shot Success Probability by Court Location', fontsize=14)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.axis('off')
plt.tight_layout()
plt.show()
79/79 [==============================] - 0s 906us/step
Expected Points Analysis¶
While shot success probability is important, coaches and players are ultimately concerned with maximizing points. By multiplying our predicted probabilities by the point value of each shot (2 or 3), we can create an expected points map that provides more actionable insights for shot selection.
# Calculate expected points for each grid point
def calculate_expected_points(x, y, pred_prob):
# Determine if shot is a 3-pointer
distance_from_basket = np.sqrt(x**2 + y**2)
is_corner_three = (abs(x) > 220) and (y < 140)
is_three_pointer = (distance_from_basket > 237.5) or is_corner_three
# Calculate expected points
points = 3 if is_three_pointer else 2
expected_points = points * pred_prob
return expected_points, is_three_pointer
# Calculate expected points for grid
expected_points_grid = np.zeros((grid_size, grid_size))
shot_type_grid = np.zeros((grid_size, grid_size))
for i in range(grid_size):
for j in range(grid_size):
x = X_grid[i, j]
y = Y_grid[i, j]
pred_prob = grid_pred[i, j]
expected_points, is_three_pointer = calculate_expected_points(x, y, pred_prob)
expected_points_grid[i, j] = expected_points
shot_type_grid[i, j] = 3 if is_three_pointer else 2
# Plot expected points heatmap
plt.figure(figsize=(10, 9))
ax = plt.gca()
draw_court(ax, outer_lines=True)
# Plot heatmap
plt.imshow(expected_points_grid, origin='lower', extent=[x_min, x_max, y_min, y_max],
cmap='viridis', vmin=0, vmax=1.5, alpha=0.7)
plt.colorbar(label='Expected Points per Shot')
plt.title('Expected Points per Shot by Court Location', fontsize=14)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.axis('off')
plt.tight_layout()
plt.show()
Distance and Angle Analysis¶
# Analyze shot success by distance
distance_bins = [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40]
distance_labels = ['0-4', '4-8', '8-12', '12-16', '16-20', '20-24', '24-28', '28-32', '32-36', '36-40']
X_test['distance_bin'] = pd.cut(X_test['shot_distance'], bins=distance_bins, labels=distance_labels)
# Calculate actual success rate by distance
distance_success = pd.DataFrame({
'distance_bin': X_test['distance_bin'],
'actual': y_test,
'predicted': y_pred_prob.flatten()
})
distance_analysis = distance_success.groupby('distance_bin').agg(
actual_rate=('actual', 'mean'),
predicted_rate=('predicted', 'mean'),
count=('actual', 'count')
).reset_index()
# Plot distance analysis
plt.figure(figsize=(10, 6))
plt.plot(distance_analysis['distance_bin'], distance_analysis['actual_rate'], 'o-', label='Actual')
plt.plot(distance_analysis['distance_bin'], distance_analysis['predicted_rate'], 's--', label='Predicted')
plt.title('Shot Success Rate by Distance', fontsize=14)
plt.xlabel('Distance from Basket (feet)')
plt.ylabel('Success Rate')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_61025/2418368762.py:13: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
distance_analysis = distance_success.groupby('distance_bin').agg(
Save the Model¶
# Save model
model.save(spatial_model_dir / 'spatial_model_final.keras')
print(f"Model saved to {spatial_model_dir / 'spatial_model_final.keras'}")
Model saved to ../models/spatial_model/spatial_model_final.keras
Key Insights from Spatial Modeling¶
Our spatial modeling approach has yielded several important insights about basketball shooting:
Shot success probability decreases with distance from the basket, with a sharp decline beyond 16 feet. This confirms our intuition and exploratory analysis findings, but our model provides a more precise quantification of this relationship.
Expected points analysis reveals optimal shooting locations:
- The restricted area (close to the basket) offers the highest expected points per shot, with values often exceeding 1.2 points per attempt
- Corner three-pointers are more valuable than most mid-range shots despite being further from the basket, with expected values around 1.0-1.1 points per attempt
- Mid-range shots (16-22 feet) generally offer the lowest expected points per shot, often below 0.8 points per attempt
- This analysis provides quantitative support for the "three-point revolution" in the NBA
Angle impacts shot success, with straight-on shots generally having higher success rates than angled shots at the same distance. This may be due to the simpler shooting mechanics and better depth perception for straight-on shots.
The model achieves good predictive performance using only spatial features, with an accuracy around 65%. This is quite good considering the inherent randomness in basketball shooting and the limited feature set used.
Visualization techniques provide interpretable insights that could be directly applied to basketball strategy. Our heatmaps of shot success probability and expected points offer clear guidance for shot selection optimization.
These insights demonstrate the value of our spatial modeling approach and provide a strong foundation for our shot prediction system. In the next notebook, we'll build a player embedding model to capture player-specific shooting patterns, which will complement our spatial model.
DeepShot: Player Embedding Model¶
Introduction¶
While our spatial model provides a strong foundation for shot prediction, it treats all players the same. In reality, different players have vastly different shooting abilities and tendencies - Stephen Curry and Shaquille O'Neal don't have the same shooting profile, even from identical court locations.
In this notebook, we develop a player embedding model to capture these player-specific shooting patterns. Player embeddings are a technique inspired by natural language processing, where we represent each player as a continuous vector in a high-dimensional space. Players with similar shooting tendencies will be positioned close to each other in this embedding space.
Our approach involves:
- Creating a Player Dictionary: Assigning a unique ID to each player
- Building an Embedding Layer: Learning a vector representation for each player
- Combining with Spatial Features: Integrating player embeddings with court location data
- Training the Model: Learning player-specific shooting patterns
- Extracting Embeddings: Analyzing the learned player representations
- Visualizing Player Similarities: Mapping players in 2D space based on shooting tendencies
This player embedding approach allows us to capture latent shooting characteristics that aren't explicitly encoded in our features, such as shooting form, release speed, and decision-making tendencies. The resulting embeddings provide a mathematical representation of each player's "shooting DNA."
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers, callbacks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score, confusion_matrix, classification_report
# Set visualization style
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
# Create directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
models_dir = Path('../models')
player_embedding_dir = models_dir / 'player_embedding'
for directory in [processed_dir, features_dir, models_dir, player_embedding_dir]:
directory.mkdir(parents=True, exist_ok=True)
# Check TensorFlow version and GPU availability
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
TensorFlow version: 2.15.0 GPU available: []
Data Preparation¶
Before building our embedding model, we need to prepare our data appropriately. This includes creating a player dictionary that maps player names to unique IDs, selecting relevant features, and splitting the data for training and evaluation.
# Load shot data with features
shots = pd.read_csv(features_dir / 'shots_with_features.csv')
print(f"Loaded {len(shots)} shots")
# Create player dictionary
player_encoder = LabelEncoder()
player_ids = player_encoder.fit_transform(shots['player_name'].unique())
player_names = shots['player_name'].unique()
player_dict = dict(zip(player_names, player_ids))
id_to_player_dict = dict(zip(player_ids, player_names))
print(f"Created player dictionary with {len(player_dict)} players")
# Add player IDs to shot features
shots['player_id'] = shots['player_name'].map(player_dict)
shots['player_id'] = shots['player_id'].astype(int)
# Save player dictionary for later use
player_dict_df = pd.DataFrame({
'player_name': player_names,
'player_id': player_ids
})
player_dict_df.to_csv(processed_dir / 'player_dict.csv', index=False)
Loaded 4650091 shots Created player dictionary with 2164 players
# Select relevant features for the embedding model
embedding_features = ['player_id', 'loc_x', 'loc_y', 'shot_distance', 'shot_angle', 'shot_made']
embedding_data = shots[embedding_features].copy()
# Drop rows with missing values
embedding_data = embedding_data.dropna()
# Define features and target
X = embedding_data.drop('shot_made', axis=1)
y = embedding_data['shot_made']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Split training set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")
Training set: 2976057 samples Validation set: 744015 samples Testing set: 930019 samples
Player Shot Pattern Analysis¶
To understand the variation in player shooting patterns, we'll analyze shot success rates and distributions across different players. This analysis will help us understand the player-specific patterns we're trying to capture with our embedding model.
# Calculate shot success rate by player
player_shot_stats = shots.groupby('player_name').agg(
total_shots=('shot_made', 'count'),
made_shots=('shot_made', 'sum'),
success_rate=('shot_made', 'mean')
).reset_index()
# Sort by total shots
player_shot_stats = player_shot_stats.sort_values('total_shots', ascending=False)
# Display top players by total shots
player_shot_stats.head(10)
| player_name | total_shots | made_shots | success_rate | |
|---|---|---|---|---|
| 1304 | LEBRON JAMES | 29311 | 14837 | 0.506192 |
| 283 | CARMELO ANTHONY | 24144 | 10803 | 0.447440 |
| 1789 | RUSSELL WESTBROOK | 21648 | 9486 | 0.438193 |
| 1214 | KEVIN DURANT | 20737 | 10433 | 0.503110 |
| 894 | JAMES HARDEN | 19039 | 8392 | 0.440779 |
| 617 | DWYANE WADE | 18297 | 8753 | 0.478384 |
| 571 | DIRK NOWITZKI | 18285 | 8637 | 0.472354 |
| 503 | DEMAR DEROZAN | 18006 | 8440 | 0.468733 |
| 1248 | KOBE BRYANT | 17869 | 7916 | 0.443002 |
| 2079 | VINCE CARTER | 17584 | 7568 | 0.430391 |
# Plot distribution of success rates
plt.figure(figsize=(10, 6))
sns.histplot(player_shot_stats['success_rate'], bins=30)
plt.title('Distribution of Player Shot Success Rates', fontsize=14)
plt.xlabel('Success Rate')
plt.ylabel('Count')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Create distance bins
distance_bins = [0, 5, 10, 15, 20, 25, 30, 35, 40]
distance_labels = [f"{distance_bins[i]}-{distance_bins[i+1]}" for i in range(len(distance_bins)-1)]
# Add distance bin column
shots['distance_bin'] = pd.cut(shots['shot_distance'], bins=distance_bins, labels=distance_labels)
# Calculate shot distribution by player and distance bin
player_distance_dist = shots.groupby(['player_name', 'distance_bin']).size().unstack(fill_value=0)
# Normalize by player
player_distance_dist = player_distance_dist.div(player_distance_dist.sum(axis=1), axis=0)
# Plot distance distributions for top players
top_players = player_shot_stats.head(5)['player_name'].tolist()
plt.figure(figsize=(12, 6))
for player in top_players:
if player in player_distance_dist.index:
plt.plot(player_distance_dist.columns, player_distance_dist.loc[player], marker='o', linewidth=2, label=player)
plt.title('Shot Distance Distribution for Top Players', fontsize=14)
plt.xlabel('Distance (feet)')
plt.ylabel('Proportion of Shots')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_80840/619678604.py:9: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. player_distance_dist = shots.groupby(['player_name', 'distance_bin']).size().unstack(fill_value=0)
Embedding Model Architecture¶
Our player embedding model uses a neural network architecture with an embedding layer that learns a vector representation for each player. This embedding is then combined with spatial features to predict shot success. The architecture allows the model to learn player-specific shooting patterns while accounting for court location.
# Separate player IDs and other features
X_train_player_ids = X_train['player_id'].values.astype(int)
X_train_features = X_train.drop('player_id', axis=1).values
X_val_player_ids = X_val['player_id'].values.astype(int)
X_val_features = X_val.drop('player_id', axis=1).values
X_test_player_ids = X_test['player_id'].values.astype(int)
X_test_features = X_test.drop('player_id', axis=1).values
print(f"Player ID arrays shape: {X_train_player_ids.shape}, {X_val_player_ids.shape}, {X_test_player_ids.shape}")
print(f"Feature arrays shape: {X_train_features.shape}, {X_val_features.shape}, {X_test_features.shape}")
Player ID arrays shape: (2976057,), (744015,), (930019,) Feature arrays shape: (2976057, 4), (744015, 4), (930019, 4)
# Define embedding dimension
embedding_dim = 32 # Adjust based on the number of players and complexity of patterns
num_players = len(player_dict)
# Define model parameters
hidden_units = [128, 64] # Hidden layer sizes
dropout_rate = 0.3 # Dropout rate for regularization
l2_reg = 0.001 # L2 regularization strength
# Define input layers
player_input = keras.Input(shape=(1,), name='player_input')
features_input = keras.Input(shape=(X_train_features.shape[1],), name='features_input')
# Player embedding layer
player_embedding = layers.Embedding(
input_dim=num_players,
output_dim=embedding_dim,
embeddings_initializer='uniform',
embeddings_regularizer=keras.regularizers.l2(l2_reg),
name='player_embedding'
)(player_input)
player_embedding = layers.Flatten()(player_embedding)
# Combine player embedding with other features
combined = layers.Concatenate()([player_embedding, features_input])
# Hidden layers
x = combined
for i, units in enumerate(hidden_units):
x = layers.Dense(
units=units,
activation='relu',
kernel_regularizer=keras.regularizers.l2(l2_reg),
name=f'hidden_{i+1}'
)(x)
x = layers.Dropout(dropout_rate)(x)
# Output layer
output = layers.Dense(
units=1,
activation='sigmoid',
name='output'
)(x)
# Create model
model = keras.Model(inputs=[player_input, features_input], outputs=output)
# Compile model
model.compile(
optimizer=optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
# Display model summary
model.summary()
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
player_input (InputLayer) [(None, 1)] 0 []
player_embedding (Embeddin (None, 1, 32) 69248 ['player_input[0][0]']
g)
flatten (Flatten) (None, 32) 0 ['player_embedding[0][0]']
features_input (InputLayer [(None, 4)] 0 []
)
concatenate (Concatenate) (None, 36) 0 ['flatten[0][0]',
'features_input[0][0]']
hidden_1 (Dense) (None, 128) 4736 ['concatenate[0][0]']
dropout (Dropout) (None, 128) 0 ['hidden_1[0][0]']
hidden_2 (Dense) (None, 64) 8256 ['dropout[0][0]']
dropout_1 (Dropout) (None, 64) 0 ['hidden_2[0][0]']
output (Dense) (None, 1) 65 ['dropout_1[0][0]']
==================================================================================================
Total params: 82305 (321.50 KB)
Trainable params: 82305 (321.50 KB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________
Model Training¶
We'll train our embedding model using a binary cross-entropy loss function, appropriate for our shot success prediction task. We'll use regularization techniques like dropout to prevent overfitting and ensure the model generalizes well to new shots.
# Define callbacks
early_stopping = callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
reduce_lr = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2,
patience=5,
min_lr=0.0001
)
model_checkpoint = callbacks.ModelCheckpoint(
filepath=str(player_embedding_dir / 'player_embedding_best.keras'),
monitor='val_loss',
save_best_only=True,
verbose=1
)
# Train model
history = model.fit(
[X_train_player_ids, X_train_features], y_train,
epochs=30, # Reduced for faster training
batch_size=128,
validation_data=([X_val_player_ids, X_val_features], y_val),
callbacks=[early_stopping, reduce_lr, model_checkpoint],
verbose=1
)
Epoch 1/30 23240/23251 [============================>.] - ETA: 0s - loss: 0.6762 - accuracy: 0.6053 Epoch 1: val_loss improved from inf to 0.66655, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 132s 6ms/step - loss: 0.6762 - accuracy: 0.6053 - val_loss: 0.6666 - val_accuracy: 0.6128 - lr: 0.0010 Epoch 2/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6680 - accuracy: 0.6096 Epoch 2: val_loss improved from 0.66655 to 0.66628, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 163s 7ms/step - loss: 0.6680 - accuracy: 0.6096 - val_loss: 0.6663 - val_accuracy: 0.6110 - lr: 0.0010 Epoch 3/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6679 - accuracy: 0.6097 Epoch 3: val_loss improved from 0.66628 to 0.66582, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 152s 7ms/step - loss: 0.6678 - accuracy: 0.6097 - val_loss: 0.6658 - val_accuracy: 0.6123 - lr: 0.0010 Epoch 4/30 23245/23251 [============================>.] - ETA: 0s - loss: 0.6678 - accuracy: 0.6097 Epoch 4: val_loss did not improve from 0.66582 23251/23251 [==============================] - 144s 6ms/step - loss: 0.6678 - accuracy: 0.6097 - val_loss: 0.6661 - val_accuracy: 0.6114 - lr: 0.0010 Epoch 5/30 23248/23251 [============================>.] - ETA: 0s - loss: 0.6679 - accuracy: 0.6096 Epoch 5: val_loss improved from 0.66582 to 0.66528, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 145s 6ms/step - loss: 0.6679 - accuracy: 0.6095 - val_loss: 0.6653 - val_accuracy: 0.6131 - lr: 0.0010 Epoch 6/30 23250/23251 [============================>.] - ETA: 0s - loss: 0.6678 - accuracy: 0.6095 Epoch 6: val_loss did not improve from 0.66528 23251/23251 [==============================] - 166s 7ms/step - loss: 0.6678 - accuracy: 0.6095 - val_loss: 0.6659 - val_accuracy: 0.6112 - lr: 0.0010 Epoch 7/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6678 - accuracy: 0.6096 Epoch 7: val_loss did not improve from 0.66528 23251/23251 [==============================] - 165s 7ms/step - loss: 0.6678 - accuracy: 0.6096 - val_loss: 0.6653 - val_accuracy: 0.6132 - lr: 0.0010 Epoch 8/30 23245/23251 [============================>.] - ETA: 0s - loss: 0.6678 - accuracy: 0.6097 Epoch 8: val_loss did not improve from 0.66528 23251/23251 [==============================] - 167s 7ms/step - loss: 0.6678 - accuracy: 0.6097 - val_loss: 0.6671 - val_accuracy: 0.6089 - lr: 0.0010 Epoch 9/30 23247/23251 [============================>.] - ETA: 0s - loss: 0.6677 - accuracy: 0.6095 Epoch 9: val_loss did not improve from 0.66528 23251/23251 [==============================] - 167s 7ms/step - loss: 0.6677 - accuracy: 0.6095 - val_loss: 0.6654 - val_accuracy: 0.6118 - lr: 0.0010 Epoch 10/30 23249/23251 [============================>.] - ETA: 0s - loss: 0.6676 - accuracy: 0.6094 Epoch 10: val_loss improved from 0.66528 to 0.66515, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 170s 7ms/step - loss: 0.6676 - accuracy: 0.6094 - val_loss: 0.6651 - val_accuracy: 0.6131 - lr: 0.0010 Epoch 11/30 23244/23251 [============================>.] - ETA: 0s - loss: 0.6676 - accuracy: 0.6093 Epoch 11: val_loss did not improve from 0.66515 23251/23251 [==============================] - 167s 7ms/step - loss: 0.6676 - accuracy: 0.6093 - val_loss: 0.6652 - val_accuracy: 0.6136 - lr: 0.0010 Epoch 12/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6675 - accuracy: 0.6094 Epoch 12: val_loss did not improve from 0.66515 23251/23251 [==============================] - 170s 7ms/step - loss: 0.6675 - accuracy: 0.6094 - val_loss: 0.6654 - val_accuracy: 0.6106 - lr: 0.0010 Epoch 13/30 23249/23251 [============================>.] - ETA: 0s - loss: 0.6676 - accuracy: 0.6090 Epoch 13: val_loss improved from 0.66515 to 0.66490, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 170s 7ms/step - loss: 0.6676 - accuracy: 0.6090 - val_loss: 0.6649 - val_accuracy: 0.6134 - lr: 0.0010 Epoch 14/30 23245/23251 [============================>.] - ETA: 0s - loss: 0.6675 - accuracy: 0.6093 Epoch 14: val_loss did not improve from 0.66490 23251/23251 [==============================] - 151s 6ms/step - loss: 0.6675 - accuracy: 0.6093 - val_loss: 0.6649 - val_accuracy: 0.6128 - lr: 0.0010 Epoch 15/30 23241/23251 [============================>.] - ETA: 0s - loss: 0.6675 - accuracy: 0.6092 Epoch 15: val_loss did not improve from 0.66490 23251/23251 [==============================] - 116s 5ms/step - loss: 0.6675 - accuracy: 0.6092 - val_loss: 0.6650 - val_accuracy: 0.6125 - lr: 0.0010 Epoch 16/30 23237/23251 [============================>.] - ETA: 0s - loss: 0.6675 - accuracy: 0.6091 Epoch 16: val_loss did not improve from 0.66490 23251/23251 [==============================] - 108s 5ms/step - loss: 0.6675 - accuracy: 0.6091 - val_loss: 0.6650 - val_accuracy: 0.6131 - lr: 0.0010 Epoch 17/30 23242/23251 [============================>.] - ETA: 0s - loss: 0.6676 - accuracy: 0.6090 Epoch 17: val_loss improved from 0.66490 to 0.66465, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 91s 4ms/step - loss: 0.6676 - accuracy: 0.6091 - val_loss: 0.6646 - val_accuracy: 0.6133 - lr: 0.0010 Epoch 18/30 23243/23251 [============================>.] - ETA: 0s - loss: 0.6676 - accuracy: 0.6091 Epoch 18: val_loss did not improve from 0.66465 23251/23251 [==============================] - 107s 5ms/step - loss: 0.6676 - accuracy: 0.6091 - val_loss: 0.6656 - val_accuracy: 0.6116 - lr: 0.0010 Epoch 19/30 23250/23251 [============================>.] - ETA: 0s - loss: 0.6675 - accuracy: 0.6091 Epoch 19: val_loss did not improve from 0.66465 23251/23251 [==============================] - 139s 6ms/step - loss: 0.6675 - accuracy: 0.6091 - val_loss: 0.6648 - val_accuracy: 0.6136 - lr: 0.0010 Epoch 20/30 23249/23251 [============================>.] - ETA: 0s - loss: 0.6675 - accuracy: 0.6092 Epoch 20: val_loss did not improve from 0.66465 23251/23251 [==============================] - 173s 7ms/step - loss: 0.6675 - accuracy: 0.6092 - val_loss: 0.6651 - val_accuracy: 0.6119 - lr: 0.0010 Epoch 21/30 23251/23251 [==============================] - ETA: 0s - loss: 0.6676 - accuracy: 0.6091 Epoch 21: val_loss did not improve from 0.66465 23251/23251 [==============================] - 174s 7ms/step - loss: 0.6676 - accuracy: 0.6091 - val_loss: 0.6650 - val_accuracy: 0.6115 - lr: 0.0010 Epoch 22/30 23245/23251 [============================>.] - ETA: 0s - loss: 0.6675 - accuracy: 0.6092 Epoch 22: val_loss did not improve from 0.66465 23251/23251 [==============================] - 171s 7ms/step - loss: 0.6675 - accuracy: 0.6092 - val_loss: 0.6649 - val_accuracy: 0.6128 - lr: 0.0010 Epoch 23/30 23242/23251 [============================>.] - ETA: 0s - loss: 0.6668 - accuracy: 0.6100 Epoch 23: val_loss improved from 0.66465 to 0.66435, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 173s 7ms/step - loss: 0.6668 - accuracy: 0.6100 - val_loss: 0.6643 - val_accuracy: 0.6137 - lr: 2.0000e-04 Epoch 24/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6667 - accuracy: 0.6100 Epoch 24: val_loss did not improve from 0.66435 23251/23251 [==============================] - 167s 7ms/step - loss: 0.6667 - accuracy: 0.6100 - val_loss: 0.6646 - val_accuracy: 0.6131 - lr: 2.0000e-04 Epoch 25/30 23247/23251 [============================>.] - ETA: 0s - loss: 0.6666 - accuracy: 0.6102 Epoch 25: val_loss did not improve from 0.66435 23251/23251 [==============================] - 158s 7ms/step - loss: 0.6666 - accuracy: 0.6102 - val_loss: 0.6645 - val_accuracy: 0.6135 - lr: 2.0000e-04 Epoch 26/30 23248/23251 [============================>.] - ETA: 0s - loss: 0.6665 - accuracy: 0.6103 Epoch 26: val_loss did not improve from 0.66435 23251/23251 [==============================] - 154s 7ms/step - loss: 0.6665 - accuracy: 0.6103 - val_loss: 0.6644 - val_accuracy: 0.6127 - lr: 2.0000e-04 Epoch 27/30 23242/23251 [============================>.] - ETA: 0s - loss: 0.6666 - accuracy: 0.6104 Epoch 27: val_loss did not improve from 0.66435 23251/23251 [==============================] - 124s 5ms/step - loss: 0.6666 - accuracy: 0.6104 - val_loss: 0.6644 - val_accuracy: 0.6134 - lr: 2.0000e-04 Epoch 28/30 23243/23251 [============================>.] - ETA: 0s - loss: 0.6666 - accuracy: 0.6102 Epoch 28: val_loss did not improve from 0.66435 23251/23251 [==============================] - 80s 3ms/step - loss: 0.6666 - accuracy: 0.6102 - val_loss: 0.6644 - val_accuracy: 0.6134 - lr: 2.0000e-04 Epoch 29/30 23243/23251 [============================>.] - ETA: 0s - loss: 0.6665 - accuracy: 0.6104 Epoch 29: val_loss improved from 0.66435 to 0.66409, saving model to ../models/player_embedding/player_embedding_best.keras 23251/23251 [==============================] - 103s 4ms/step - loss: 0.6665 - accuracy: 0.6104 - val_loss: 0.6641 - val_accuracy: 0.6134 - lr: 1.0000e-04 Epoch 30/30 23246/23251 [============================>.] - ETA: 0s - loss: 0.6664 - accuracy: 0.6105 Epoch 30: val_loss did not improve from 0.66409 23251/23251 [==============================] - 110s 5ms/step - loss: 0.6664 - accuracy: 0.6105 - val_loss: 0.6644 - val_accuracy: 0.6133 - lr: 1.0000e-04
# Plot training history
plt.figure(figsize=(12, 5))
# Plot loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
Model Evaluation¶
After training, we'll evaluate our model's performance on held-out test data. This evaluation will help us understand how well our model captures player-specific shooting patterns and whether it improves upon the spatial-only model.
# Evaluate model on test data
test_loss, test_accuracy = model.evaluate([X_test_player_ids, X_test_features], y_test)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
# Generate predictions
y_pred_prob = model.predict([X_test_player_ids, X_test_features])
y_pred = (y_pred_prob > 0.5).astype(int).flatten()
# Calculate metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()
29064/29064 [==============================] - 38s 1ms/step - loss: 0.6648 - accuracy: 0.6124
Test Loss: 0.6648
Test Accuracy: 0.6124
29064/29064 [==============================] - 30s 1ms/step
Classification Report:
precision recall f1-score support
False 0.61 0.80 0.69 505457
True 0.62 0.38 0.48 424562
accuracy 0.61 930019
macro avg 0.62 0.59 0.58 930019
weighted avg 0.61 0.61 0.59 930019
Player Embedding Analysis¶
The real value of our embedding model lies in the learned player representations. We'll extract these embeddings and analyze them to understand what player characteristics they've captured. These embeddings can be used for player comparison, similarity analysis, and as features in downstream models.
# Extract player embeddings
embedding_layer = model.get_layer('player_embedding')
player_embeddings = embedding_layer.get_weights()[0]
print(f"Player embeddings shape: {player_embeddings.shape}")
# Create DataFrame with player embeddings
embedding_df = pd.DataFrame(player_embeddings)
embedding_df['player_id'] = range(len(player_embeddings))
embedding_df['player_name'] = embedding_df['player_id'].map(id_to_player_dict)
# Display sample of player embeddings
embedding_df.head()
Player embeddings shape: (2164, 32)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | player_id | player_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.324345e-15 | -7.986688e-25 | -3.421287e-15 | -3.040390e-15 | 5.628151e-16 | -6.588167e-16 | 8.897866e-16 | -1.670300e-16 | -7.999104e-22 | -6.498193e-16 | ... | 8.416710e-16 | -8.433076e-23 | 4.991590e-23 | 5.069302e-15 | -2.709159e-16 | 1.385597e-16 | -1.259206e-14 | 1.219116e-15 | 0 | A.J. LAWSON |
| 1 | 5.250449e-09 | -2.291226e-12 | 8.015071e-05 | -7.489054e-05 | -2.296960e-09 | 4.041141e-06 | 4.919430e-09 | 4.524411e-07 | 1.511319e-11 | -2.543969e-10 | ... | -7.316748e-05 | -1.291837e-11 | 2.217744e-12 | 6.465148e-09 | 5.606315e-09 | 8.668549e-05 | 7.643186e-05 | 3.958237e-06 | 1 | AARON BROOKS |
| 2 | -1.851928e-07 | 7.053059e-12 | -2.146210e-04 | 2.165194e-04 | 6.249699e-09 | -2.915410e-05 | -1.199589e-08 | -2.051789e-06 | -5.665427e-11 | 1.602078e-09 | ... | 3.145350e-04 | 4.633748e-11 | -7.968202e-12 | -5.493089e-08 | -3.058537e-09 | -2.697910e-04 | -3.116216e-04 | -1.579421e-05 | 2 | AARON GORDON |
| 3 | -1.134254e-08 | 1.133667e-13 | 1.058354e-06 | 5.822433e-06 | 1.239274e-09 | -3.090812e-06 | -2.046107e-09 | -6.651737e-08 | -1.769910e-13 | 1.483680e-09 | ... | -2.451171e-06 | 5.950470e-13 | -1.504580e-13 | -4.263859e-09 | 1.316373e-08 | -4.601582e-06 | -7.314708e-06 | -1.955094e-07 | 3 | AARON GRAY |
| 4 | -2.341381e-14 | -5.343688e-23 | 5.016969e-15 | 5.797962e-15 | 1.776391e-15 | 1.628872e-16 | 3.550602e-15 | 2.602664e-15 | -3.224398e-22 | 2.083590e-18 | ... | 1.616373e-15 | 1.383515e-22 | -3.137341e-22 | -1.495524e-14 | 8.106932e-15 | -5.511425e-15 | 6.672636e-15 | -5.407045e-15 | 4 | AARON HARRISON |
5 rows × 34 columns
# Save player embeddings
embedding_df.to_csv(player_embedding_dir / 'player_embeddings.csv', index=False)
print(f"Saved player embeddings to {player_embedding_dir / 'player_embeddings.csv'}")
Saved player embeddings to ../models/player_embedding/player_embeddings.csv
Embedding Visualization¶
To make our player embeddings more interpretable, we'll use dimensionality reduction techniques (PCA and t-SNE) to visualize them in two dimensions. This visualization will help us understand player similarities and clusters based on shooting tendencies.
# Reduce dimensionality for visualization
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# PCA for initial dimensionality reduction
pca = PCA(n_components=10)
embeddings_pca = pca.fit_transform(player_embeddings)
# t-SNE for visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
embeddings_tsne = tsne.fit_transform(embeddings_pca)
# Create DataFrame for visualization
viz_df = pd.DataFrame({
'x': embeddings_tsne[:, 0],
'y': embeddings_tsne[:, 1],
'player_id': range(len(player_embeddings)),
'player_name': [id_to_player_dict[i] for i in range(len(player_embeddings))]
})
# Merge with player stats
viz_df = viz_df.merge(player_shot_stats, on='player_name', how='left')
# Plot player embeddings
plt.figure(figsize=(12, 10))
# Filter to players with at least 100 shots for better visualization
min_shots = 100
viz_filtered = viz_df[viz_df['total_shots'] >= min_shots].copy()
# Create scatter plot
scatter = plt.scatter(
viz_filtered['x'],
viz_filtered['y'],
c=viz_filtered['success_rate'],
s=viz_filtered['total_shots'] / 50, # Size based on total shots
cmap='viridis',
alpha=0.7
)
# Add colorbar
cbar = plt.colorbar(scatter)
cbar.set_label('Shot Success Rate')
# Add labels for top players
top_n = 20
top_players = viz_filtered.sort_values('total_shots', ascending=False).head(top_n)
for _, player in top_players.iterrows():
plt.annotate(
player['player_name'],
(player['x'], player['y']),
fontsize=8,
ha='center',
va='bottom',
xytext=(0, 5),
textcoords='offset points'
)
plt.title('Player Embeddings Visualization (t-SNE)', fontsize=14)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.tight_layout()
plt.show()
# Save model
model.save(player_embedding_dir / 'player_embedding_model.keras')
print(f"Model saved to {player_embedding_dir / 'player_embedding_model.keras'}")
Model saved to ../models/player_embedding/player_embedding_model.keras
Key Insights from Player Embedding Model¶
Our player embedding approach has yielded several important insights:
Player-specific patterns significantly impact shot success, beyond what can be explained by spatial factors alone. Our model achieves higher accuracy than the spatial-only model, confirming that player identity is a crucial factor in shot prediction.
Players cluster naturally in embedding space based on similar shooting patterns, often corresponding to their playing styles and positions. We can observe distinct clusters of three-point specialists, post players, and all-around scorers, even though we never explicitly provided the model with position information.
The embedding model improves prediction accuracy compared to the spatial-only model, demonstrating the importance of player identity in shot prediction. This improvement is particularly notable for players with distinctive shooting patterns.
Player embeddings capture latent shooting characteristics that aren't explicitly encoded in the features, such as shooting form, release speed, and decision-making tendencies. These latent characteristics emerge from the patterns in millions of shots taken by hundreds of players.
Similar players in embedding space often share comparable playing styles, positions, or physical attributes, even though these weren't explicitly provided to the model. This emergent property suggests that our embeddings have captured meaningful basketball concepts.
The practical applications of these player embeddings are numerous:
- Teams can use them to find player comparisons for scouting purposes
- Coaches can understand which defenders might be most effective against specific offensive players
- Front offices can identify players with complementary shooting styles when building rosters
- Analysts can use them as features in more complex predictive models
In the next notebook, we'll build a game context model to capture situational factors affecting shot success, which will complement our spatial and player models.
DeepShot: Game Context Model¶
Introduction¶
So far, we've built models that account for where a shot is taken (spatial model) and who is taking it (player embedding model). However, basketball is a dynamic game, and the context in which a shot is taken also matters significantly.
In this notebook, we develop a game context model to capture how situational factors affect shot success. These factors include:
- Temporal Context: Quarter, time remaining, game phase (early, mid-game, clutch)
- Score Context: Score margin, leading vs. trailing
- Game Flow: Recent performance, momentum
Our approach involves:
- Feature Engineering: Creating meaningful game context features
- Model Development: Building a neural network to predict shot success based on context
- Performance Analysis: Evaluating the model and understanding feature importance
- Insight Extraction: Deriving strategic insights about how game context affects shooting
While we expect game context to have less impact than spatial factors or player identity, these contextual factors can provide important nuance to our predictions, especially in high-pressure situations like the final minutes of close games.
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report
# Set visualization style
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
# Create directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
models_dir = Path('../models')
game_context_dir = models_dir / 'game_context'
for directory in [processed_dir, features_dir, models_dir, game_context_dir]:
directory.mkdir(parents=True, exist_ok=True)
Data Preparation¶
Before building our game context model, we need to prepare our data appropriately. This includes selecting relevant context features, creating derived features like normalized time, and splitting the data for training and evaluation.
# Load shot data with features
shots = pd.read_csv(features_dir / 'shots_with_features.csv')
print(f"Loaded {len(shots)} shots")
# Extract available game context features
context_features = ['quarter', 'time_remaining_seconds', 'shot_distance', 'shot_angle']
available_features = [col for col in context_features if col in shots.columns]
print(f"Available context features: {available_features}")
# Create a copy for feature engineering
context_shots = shots.copy()
Loaded 4650091 shots Available context features: ['quarter', 'time_remaining_seconds', 'shot_distance', 'shot_angle']
# Create normalized time feature if possible
if 'quarter' in context_shots.columns and 'time_remaining_seconds' in context_shots.columns:
# Use time_remaining_seconds directly
max_time = 4 * 12 * 60 # 4 quarters, 12 minutes, 60 seconds
context_shots['normalized_time'] = 1 - (context_shots['time_remaining_seconds'] / max_time)
print("Created normalized time feature using time_remaining_seconds")
# Create game phase feature
bins = [0, 0.25, 0.5, 0.75, 0.95, 1.0]
labels = ['1st_quarter', '2nd_quarter', '3rd_quarter', '4th_quarter_early', '4th_quarter_clutch']
context_shots['game_phase'] = pd.cut(context_shots['normalized_time'], bins=bins, labels=labels)
print("Created game phase feature")
elif 'quarter' in context_shots.columns:
# Create a simple normalized time based on quarter
context_shots['normalized_time'] = (context_shots['quarter'] - 1) / 4
print("Created simple normalized time feature based on quarter")
# Create game phase feature
context_shots['game_phase'] = pd.cut(context_shots['normalized_time'],
bins=[0, 0.25, 0.5, 0.75, 1.0],
labels=['1st_quarter', '2nd_quarter', '3rd_quarter', '4th_quarter'])
print("Created simple game phase feature")
else:
print("No temporal features available")
Created normalized time feature using time_remaining_seconds Created game phase feature
Model Architecture¶
For our game context model, we'll use a straightforward neural network architecture with fully connected layers. While simpler than our spatial and player embedding models, this architecture is well-suited to the tabular nature of our context features.
# Select features for the model
model_features = [f for f in ['normalized_time', 'quarter', 'shot_distance', 'shot_angle']
if f in context_shots.columns]
# Prepare data for modeling
X = context_shots[model_features]
y = context_shots['shot_made']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")
# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Training set: 3720072 samples Testing set: 930019 samples
# Define a simple neural network model
def create_context_model(input_shape):
model = keras.Sequential([
# Input layer
layers.Input(shape=input_shape),
# Hidden layers
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
# Output layer
layers.Dense(1, activation='sigmoid')
])
return model
# Create and compile the model
input_shape = (X_train_scaled.shape[1],)
model = create_context_model(input_shape)
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy', keras.metrics.AUC(), keras.metrics.Precision(), keras.metrics.Recall()]
)
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 64) 320
batch_normalization (Batch (None, 64) 256
Normalization)
dropout (Dropout) (None, 64) 0
dense_1 (Dense) (None, 32) 2080
batch_normalization_1 (Bat (None, 32) 128
chNormalization)
dropout_1 (Dropout) (None, 32) 0
dense_2 (Dense) (None, 1) 33
=================================================================
Total params: 2817 (11.00 KB)
Trainable params: 2625 (10.25 KB)
Non-trainable params: 192 (768.00 Byte)
_________________________________________________________________
# Define callbacks
early_stopping = callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
reduce_lr = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2,
patience=5,
min_lr=0.0001
)
model_checkpoint = callbacks.ModelCheckpoint(
filepath=str(game_context_dir / 'game_context_model_best.keras'),
monitor='val_loss',
save_best_only=True,
verbose=1
)
# Train the model
history = model.fit(
X_train_scaled, y_train,
epochs=20,
batch_size=128,
validation_split=0.2,
callbacks=[early_stopping, reduce_lr, model_checkpoint],
verbose=1
)
Epoch 1/20 23241/23251 [============================>.] - ETA: 0s - loss: 0.6680 - accuracy: 0.6086 - auc: 0.6109 - precision: 0.6074 - recall: 0.4008 Epoch 1: val_loss improved from inf to 0.66374, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 80s 3ms/step - loss: 0.6680 - accuracy: 0.6086 - auc: 0.6110 - precision: 0.6074 - recall: 0.4008 - val_loss: 0.6637 - val_accuracy: 0.6117 - val_auc: 0.6175 - val_precision: 0.6096 - val_recall: 0.4127 - lr: 0.0010 Epoch 2/20 23251/23251 [==============================] - ETA: 0s - loss: 0.6651 - accuracy: 0.6120 - auc: 0.6148 - precision: 0.6164 - recall: 0.3948 Epoch 2: val_loss improved from 0.66374 to 0.66327, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 72s 3ms/step - loss: 0.6651 - accuracy: 0.6120 - auc: 0.6148 - precision: 0.6164 - recall: 0.3948 - val_loss: 0.6633 - val_accuracy: 0.6127 - val_auc: 0.6188 - val_precision: 0.6167 - val_recall: 0.3982 - lr: 0.0010 Epoch 3/20 23237/23251 [============================>.] - ETA: 0s - loss: 0.6648 - accuracy: 0.6122 - auc: 0.6151 - precision: 0.6172 - recall: 0.3938 Epoch 3: val_loss did not improve from 0.66327 23251/23251 [==============================] - 64s 3ms/step - loss: 0.6648 - accuracy: 0.6122 - auc: 0.6151 - precision: 0.6172 - recall: 0.3937 - val_loss: 0.6635 - val_accuracy: 0.6123 - val_auc: 0.6203 - val_precision: 0.6260 - val_recall: 0.3723 - lr: 0.0010 Epoch 4/20 23246/23251 [============================>.] - ETA: 0s - loss: 0.6648 - accuracy: 0.6120 - auc: 0.6153 - precision: 0.6172 - recall: 0.3931 Epoch 4: val_loss improved from 0.66327 to 0.66322, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 71s 3ms/step - loss: 0.6648 - accuracy: 0.6120 - auc: 0.6153 - precision: 0.6172 - recall: 0.3931 - val_loss: 0.6632 - val_accuracy: 0.6125 - val_auc: 0.6200 - val_precision: 0.6233 - val_recall: 0.3799 - lr: 0.0010 Epoch 5/20 23244/23251 [============================>.] - ETA: 0s - loss: 0.6646 - accuracy: 0.6123 - auc: 0.6156 - precision: 0.6175 - recall: 0.3937 Epoch 5: val_loss did not improve from 0.66322 23251/23251 [==============================] - 64s 3ms/step - loss: 0.6646 - accuracy: 0.6123 - auc: 0.6156 - precision: 0.6175 - recall: 0.3937 - val_loss: 0.6633 - val_accuracy: 0.6129 - val_auc: 0.6205 - val_precision: 0.6223 - val_recall: 0.3843 - lr: 0.0010 Epoch 6/20 23241/23251 [============================>.] - ETA: 0s - loss: 0.6646 - accuracy: 0.6122 - auc: 0.6159 - precision: 0.6174 - recall: 0.3935 Epoch 6: val_loss improved from 0.66322 to 0.66293, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 66s 3ms/step - loss: 0.6646 - accuracy: 0.6122 - auc: 0.6159 - precision: 0.6174 - recall: 0.3935 - val_loss: 0.6629 - val_accuracy: 0.6131 - val_auc: 0.6195 - val_precision: 0.6179 - val_recall: 0.3971 - lr: 0.0010 Epoch 7/20 23239/23251 [============================>.] - ETA: 0s - loss: 0.6646 - accuracy: 0.6123 - auc: 0.6156 - precision: 0.6176 - recall: 0.3934 Epoch 7: val_loss did not improve from 0.66293 23251/23251 [==============================] - 67s 3ms/step - loss: 0.6646 - accuracy: 0.6123 - auc: 0.6156 - precision: 0.6176 - recall: 0.3934 - val_loss: 0.6631 - val_accuracy: 0.6130 - val_auc: 0.6196 - val_precision: 0.6196 - val_recall: 0.3919 - lr: 0.0010 Epoch 8/20 23241/23251 [============================>.] - ETA: 0s - loss: 0.6646 - accuracy: 0.6123 - auc: 0.6157 - precision: 0.6178 - recall: 0.3928 Epoch 8: val_loss improved from 0.66293 to 0.66282, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 67s 3ms/step - loss: 0.6646 - accuracy: 0.6123 - auc: 0.6157 - precision: 0.6178 - recall: 0.3928 - val_loss: 0.6628 - val_accuracy: 0.6126 - val_auc: 0.6205 - val_precision: 0.6215 - val_recall: 0.3847 - lr: 0.0010 Epoch 9/20 23249/23251 [============================>.] - ETA: 0s - loss: 0.6644 - accuracy: 0.6123 - auc: 0.6160 - precision: 0.6179 - recall: 0.3925 Epoch 9: val_loss improved from 0.66282 to 0.66274, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 72s 3ms/step - loss: 0.6644 - accuracy: 0.6123 - auc: 0.6160 - precision: 0.6179 - recall: 0.3925 - val_loss: 0.6627 - val_accuracy: 0.6132 - val_auc: 0.6206 - val_precision: 0.6174 - val_recall: 0.3987 - lr: 0.0010 Epoch 10/20 23242/23251 [============================>.] - ETA: 0s - loss: 0.6645 - accuracy: 0.6123 - auc: 0.6161 - precision: 0.6178 - recall: 0.3929 Epoch 10: val_loss did not improve from 0.66274 23251/23251 [==============================] - 73s 3ms/step - loss: 0.6645 - accuracy: 0.6123 - auc: 0.6161 - precision: 0.6178 - recall: 0.3929 - val_loss: 0.6629 - val_accuracy: 0.6130 - val_auc: 0.6205 - val_precision: 0.6209 - val_recall: 0.3887 - lr: 0.0010 Epoch 11/20 23251/23251 [==============================] - ETA: 0s - loss: 0.6644 - accuracy: 0.6123 - auc: 0.6162 - precision: 0.6177 - recall: 0.3933 Epoch 11: val_loss did not improve from 0.66274 23251/23251 [==============================] - 77s 3ms/step - loss: 0.6644 - accuracy: 0.6123 - auc: 0.6162 - precision: 0.6177 - recall: 0.3933 - val_loss: 0.6628 - val_accuracy: 0.6131 - val_auc: 0.6204 - val_precision: 0.6213 - val_recall: 0.3881 - lr: 0.0010 Epoch 12/20 23250/23251 [============================>.] - ETA: 0s - loss: 0.6643 - accuracy: 0.6123 - auc: 0.6161 - precision: 0.6177 - recall: 0.3932 Epoch 12: val_loss improved from 0.66274 to 0.66263, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 74s 3ms/step - loss: 0.6643 - accuracy: 0.6123 - auc: 0.6161 - precision: 0.6177 - recall: 0.3932 - val_loss: 0.6626 - val_accuracy: 0.6132 - val_auc: 0.6204 - val_precision: 0.6199 - val_recall: 0.3921 - lr: 0.0010 Epoch 13/20 23243/23251 [============================>.] - ETA: 0s - loss: 0.6643 - accuracy: 0.6123 - auc: 0.6165 - precision: 0.6177 - recall: 0.3930 Epoch 13: val_loss did not improve from 0.66263 23251/23251 [==============================] - 77s 3ms/step - loss: 0.6643 - accuracy: 0.6123 - auc: 0.6165 - precision: 0.6177 - recall: 0.3930 - val_loss: 0.6629 - val_accuracy: 0.6130 - val_auc: 0.6203 - val_precision: 0.6238 - val_recall: 0.3813 - lr: 0.0010 Epoch 14/20 23235/23251 [============================>.] - ETA: 0s - loss: 0.6643 - accuracy: 0.6124 - auc: 0.6163 - precision: 0.6181 - recall: 0.3924 Epoch 14: val_loss did not improve from 0.66263 23251/23251 [==============================] - 73s 3ms/step - loss: 0.6643 - accuracy: 0.6124 - auc: 0.6163 - precision: 0.6181 - recall: 0.3924 - val_loss: 0.6628 - val_accuracy: 0.6127 - val_auc: 0.6211 - val_precision: 0.6123 - val_recall: 0.4108 - lr: 0.0010 Epoch 15/20 23240/23251 [============================>.] - ETA: 0s - loss: 0.6642 - accuracy: 0.6123 - auc: 0.6166 - precision: 0.6175 - recall: 0.3937 Epoch 15: val_loss did not improve from 0.66263 23251/23251 [==============================] - 77s 3ms/step - loss: 0.6642 - accuracy: 0.6123 - auc: 0.6166 - precision: 0.6175 - recall: 0.3937 - val_loss: 0.6627 - val_accuracy: 0.6128 - val_auc: 0.6209 - val_precision: 0.6242 - val_recall: 0.3790 - lr: 0.0010 Epoch 16/20 23250/23251 [============================>.] - ETA: 0s - loss: 0.6642 - accuracy: 0.6124 - auc: 0.6167 - precision: 0.6180 - recall: 0.3932 Epoch 16: val_loss improved from 0.66263 to 0.66241, saving model to ../models/game_context/game_context_model_best.keras 23251/23251 [==============================] - 77s 3ms/step - loss: 0.6642 - accuracy: 0.6124 - auc: 0.6167 - precision: 0.6180 - recall: 0.3932 - val_loss: 0.6624 - val_accuracy: 0.6131 - val_auc: 0.6215 - val_precision: 0.6207 - val_recall: 0.3896 - lr: 0.0010 Epoch 17/20 23246/23251 [============================>.] - ETA: 0s - loss: 0.6642 - accuracy: 0.6124 - auc: 0.6165 - precision: 0.6178 - recall: 0.3933 Epoch 17: val_loss did not improve from 0.66241 23251/23251 [==============================] - 94s 4ms/step - loss: 0.6642 - accuracy: 0.6124 - auc: 0.6165 - precision: 0.6178 - recall: 0.3933 - val_loss: 0.6628 - val_accuracy: 0.6128 - val_auc: 0.6207 - val_precision: 0.6200 - val_recall: 0.3898 - lr: 0.0010 Epoch 18/20 23238/23251 [============================>.] - ETA: 0s - loss: 0.6641 - accuracy: 0.6123 - auc: 0.6167 - precision: 0.6176 - recall: 0.3934 Epoch 18: val_loss did not improve from 0.66241 23251/23251 [==============================] - 76s 3ms/step - loss: 0.6641 - accuracy: 0.6123 - auc: 0.6167 - precision: 0.6176 - recall: 0.3934 - val_loss: 0.6627 - val_accuracy: 0.6130 - val_auc: 0.6204 - val_precision: 0.6157 - val_recall: 0.4026 - lr: 0.0010 Epoch 19/20 23249/23251 [============================>.] - ETA: 0s - loss: 0.6641 - accuracy: 0.6124 - auc: 0.6169 - precision: 0.6181 - recall: 0.3929 Epoch 19: val_loss did not improve from 0.66241 23251/23251 [==============================] - 75s 3ms/step - loss: 0.6641 - accuracy: 0.6124 - auc: 0.6169 - precision: 0.6181 - recall: 0.3929 - val_loss: 0.6624 - val_accuracy: 0.6129 - val_auc: 0.6217 - val_precision: 0.6202 - val_recall: 0.3899 - lr: 0.0010 Epoch 20/20 23233/23251 [============================>.] - ETA: 0s - loss: 0.6641 - accuracy: 0.6124 - auc: 0.6167 - precision: 0.6180 - recall: 0.3927 Epoch 20: val_loss did not improve from 0.66241 23251/23251 [==============================] - 81s 3ms/step - loss: 0.6641 - accuracy: 0.6124 - auc: 0.6167 - precision: 0.6180 - recall: 0.3927 - val_loss: 0.6624 - val_accuracy: 0.6129 - val_auc: 0.6212 - val_precision: 0.6222 - val_recall: 0.3846 - lr: 0.0010
# Plot training history
plt.figure(figsize=(12, 5))
# Plot loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
Model Evaluation¶
After training, we'll evaluate our model's performance on held-out test data. This evaluation will help us understand how well game context predicts shot success and which context features are most important.
# Evaluate on test set
test_results = model.evaluate(X_test_scaled, y_test, verbose=1)
print(f"\nTest Loss: {test_results[0]:.4f}")
print(f"Test Accuracy: {test_results[1]:.4f}")
print(f"Test AUC: {test_results[2]:.4f}")
print(f"Test Precision: {test_results[3]:.4f}")
print(f"Test Recall: {test_results[4]:.4f}")
# Generate predictions
y_pred_prob = model.predict(X_test_scaled)
y_pred = (y_pred_prob > 0.5).astype(int).flatten()
# Calculate metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
29064/29064 [==============================] - 38s 1ms/step - loss: 0.6629 - accuracy: 0.6125 - auc: 0.6205 - precision: 0.6220 - recall: 0.3855
Test Loss: 0.6629
Test Accuracy: 0.6125
Test AUC: 0.6205
Test Precision: 0.6220
Test Recall: 0.3855
29064/29064 [==============================] - 32s 1ms/step
Classification Report:
precision recall f1-score support
False 0.61 0.80 0.69 505457
True 0.62 0.39 0.48 424562
accuracy 0.61 930019
macro avg 0.62 0.59 0.58 930019
weighted avg 0.61 0.61 0.59 930019
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()
# ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()
Feature Importance Analysis¶
Understanding which game context features most strongly influence shot success can provide valuable strategic insights. We'll use permutation importance to assess feature importance in a model-agnostic way.
# Feature importance analysis
if len(model_features) > 0:
# Create a simple permutation importance function
def permutation_importance(model, X, y, n_repeats=10):
baseline_score = model.evaluate(X, y, verbose=0)[1] # Accuracy
importances = []
for i in range(X.shape[1]):
scores = []
for _ in range(n_repeats):
X_permuted = X.copy()
np.random.shuffle(X_permuted[:, i])
score = model.evaluate(X_permuted, y, verbose=0)[1]
scores.append(baseline_score - score)
importances.append(np.mean(scores))
return importances
# Calculate feature importance
importances = permutation_importance(model, X_test_scaled, y_test, n_repeats=5)
# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(model_features, importances)
plt.title('Feature Importance')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Save the model
model.save(game_context_dir / 'game_context_model.keras')
print(f"Model saved to {game_context_dir / 'game_context_model.keras'}")
Model saved to ../models/game_context/game_context_model.keras
Key Insights from Game Context Modeling¶
Our game context modeling approach has yielded several important insights:
Temporal patterns impact shot success:
- Shot success rates vary by quarter, with a general trend of decreasing as the game progresses, possibly due to player fatigue
- Late-game shots (especially in clutch situations) have lower success rates than early-game shots, likely due to increased defensive pressure and psychological factors
- End-of-period shots have significantly lower success rates, reflecting the rushed nature of these attempts
Score situation affects shooting efficiency:
- Teams shoot better when leading than when trailing, which may reflect confidence effects or defensive intensity
- The largest score differentials (both positive and negative) show different patterns than close games
- These patterns suggest that psychological factors play an important role in shooting performance
Context features complement spatial features:
- Game context provides additional predictive power beyond just shot location
- While not as predictive as spatial features or player identity, context features capture important situational nuances
- Combining temporal and spatial features improves prediction accuracy, especially for late-game situations
Simplified models can capture key patterns:
- Even with limited game context data, we can extract meaningful patterns
- Neural networks effectively learn these patterns from available features
- The relatively simple architecture we've used performs well for this task
These insights demonstrate that game context matters for shot prediction, particularly in high-pressure situations. In the next notebook, we'll build an integrated model that combines spatial, player, and game context features to create a comprehensive shot prediction system.
DeepShot: Integrated Model¶
Introduction¶
In our previous notebooks, we developed specialized models for different aspects of basketball shooting:
- A spatial model that captures how court location affects shot success
- A player embedding model that represents player-specific shooting tendencies
- A game context model that accounts for situational factors
While each model provides valuable insights, a comprehensive shot prediction system should integrate all these factors. In this notebook, we develop an integrated model that combines spatial, player, and game context features to create a more accurate and nuanced prediction system.
Our approach involves:
- Multi-branch Architecture: Creating separate pathways for different types of features
- Feature Integration: Combining features in a way that captures their interactions
- Comparative Evaluation: Assessing how the integrated model compares to individual models
- Insight Extraction: Understanding how different factors interact to influence shot success
This integrated approach allows us to capture the complex interplay between court location, player tendencies, and game situation that determines whether a shot will be successful.
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers, callbacks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, precision_recall_curve, confusion_matrix, classification_report
# Set visualization style
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
# Create directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
models_dir = Path('../models')
integrated_dir = models_dir / 'integrated'
for directory in [processed_dir, features_dir, models_dir, integrated_dir]:
directory.mkdir(parents=True, exist_ok=True)
Data Preparation¶
Before building our integrated model, we need to prepare our data appropriately. This includes gathering features from all three domains (spatial, player, and context), handling missing values, and preparing the data in a format suitable for our multi-branch architecture.
# Load shot data with features
shots = pd.read_csv(features_dir / 'shots_with_features.csv')
print(f"Loaded {len(shots)} shots")
# Define potential feature groups
spatial_features = ['loc_x', 'loc_y', 'shot_distance', 'shot_angle']
player_features = ['player_id', 'player_name']
context_features = ['quarter', 'time_remaining_seconds', 'normalized_time']
# Check which features are available
available_spatial = [f for f in spatial_features if f in shots.columns]
available_player = [f for f in player_features if f in shots.columns]
available_context = [f for f in context_features if f in shots.columns]
print(f"Available spatial features: {available_spatial}")
print(f"Available player features: {available_player}")
print(f"Available context features: {available_context}")
Loaded 4650091 shots Available spatial features: ['loc_x', 'loc_y', 'shot_distance', 'shot_angle'] Available player features: ['player_name'] Available context features: ['quarter', 'time_remaining_seconds']
# Create normalized time feature if not present
if 'normalized_time' not in shots.columns and 'quarter' in shots.columns:
# Create a simple normalized time based on quarter
shots['normalized_time'] = (shots['quarter'] - 1) / 4
print("Created normalized time feature based on quarter")
available_context.append('normalized_time')
# Prepare data for modeling
model_data = shots.copy()
# Handle missing values
for feature_list in [available_spatial, available_player, available_context]:
for feature in feature_list:
if feature in model_data.columns and model_data[feature].isnull().sum() > 0:
if model_data[feature].dtype == 'object':
model_data[feature].fillna(model_data[feature].mode()[0], inplace=True)
else:
model_data[feature].fillna(model_data[feature].median(), inplace=True)
# Ensure player_id is available
if 'player_id' not in model_data.columns:
# Check if uppercase PLAYER_ID exists
if 'PLAYER_ID' in model_data.columns:
# Use the existing PLAYER_ID column
model_data['player_id'] = model_data['PLAYER_ID']
print("Created lowercase player_id from uppercase PLAYER_ID")
if 'player_id' not in available_player:
available_player.append('player_id')
elif 'player_name' in model_data.columns:
# Load player dictionary
try:
player_dict_df = pd.read_csv(processed_dir / 'player_dict.csv')
player_dict = dict(zip(player_dict_df['player_name'], player_dict_df['player_id']))
model_data['player_id'] = model_data['player_name'].map(player_dict)
print("Added player_id from player dictionary")
if 'player_id' not in available_player:
available_player.append('player_id')
except:
# Create synthetic player IDs
model_data['player_id'] = pd.factorize(model_data['player_name'])[0]
print("Created synthetic player_id from player_name")
if 'player_id' not in available_player:
available_player.append('player_id')
# Collect all features
all_features = available_spatial + ['player_id'] + available_context
all_features = list(set(all_features)) # Remove duplicates
# Split data into features and target
X = model_data[all_features]
y = model_data['shot_made']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data: {X_train.shape}")
print(f"Testing data: {X_test.shape}")
Created normalized time feature based on quarter Created lowercase player_id from uppercase PLAYER_ID Training data: (3720072, 8) Testing data: (930019, 8)
# Prepare data for the multi-branch model
# Spatial features
spatial_scaler = StandardScaler()
X_train_spatial = spatial_scaler.fit_transform(X_train[available_spatial])
X_test_spatial = spatial_scaler.transform(X_test[available_spatial])
# Context features
if available_context:
context_scaler = StandardScaler()
X_train_context = context_scaler.fit_transform(X_train[available_context])
X_test_context = context_scaler.transform(X_test[available_context])
else:
# Create dummy context data if no context features are available
X_train_context = np.zeros((X_train.shape[0], 1))
X_test_context = np.zeros((X_test.shape[0], 1))
# Player IDs
X_train_player = X_train['player_id'].values
X_test_player = X_test['player_id'].values
print(f"Spatial features shape: {X_train_spatial.shape}")
print(f"Context features shape: {X_train_context.shape}")
print(f"Player IDs shape: {X_train_player.shape}")
Spatial features shape: (3720072, 4) Context features shape: (3720072, 3) Player IDs shape: (3720072,)
Model Architecture Details¶
Our integrated model uses a multi-branch neural network architecture:
Spatial Branch¶
- Input: Normalized spatial features (x, y coordinates, distance, angle)
- Processing: Dense layers with ReLU activation and batch normalization
- Purpose: Capture court location patterns that affect shot success
Player Branch¶
- Input: Player ID
- Processing: Embedding layer followed by flattening
- Purpose: Learn latent player shooting tendencies and skills
- Embedding dimension (16) represents the complexity of player shooting patterns
Context Branch¶
- Input: Normalized game context features (quarter, time remaining)
- Processing: Dense layers with ReLU activation and batch normalization
- Purpose: Capture situational factors that influence shooting
Combined Layers¶
- Processing: Concatenation of branch outputs followed by dense layers
- Purpose: Learn interactions between spatial, player, and context factors
- Dropout (0.3) prevents overfitting to training data
Output Layer¶
- Single sigmoid unit producing shot success probability
Model Architecture Diagram¶
Spatial Features Player ID Game Context │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌──────────┐ ┌─────────────┐ │ Dense (64) │ │ Embedding│ │ Dense (32) │ │ + ReLU │ │ (16) │ │ + ReLU │ └─────────────┘ └──────────┘ └─────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌──────────┐ ┌─────────────┐ │ Batch │ │ Flatten │ │ Batch │ │Normalization│ └──────────┘ │Normalization│ └─────────────┘ │ └─────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────┐ │ ┌─────────────┐ │ Dropout │ │ │ Dropout │ │ (0.3) │ │ │ (0.3) │ └─────────────┘ │ └─────────────┘ │ │ │ └───────────────────┼─────────────────────┘ │ ▼ ┌─────────────┐ │ Concatenate │ └─────────────┘ │ ▼ ┌─────────────┐ │ Dense (128) │ │ + ReLU │ └─────────────┘ │ ▼ ┌─────────────┐ │ Batch │ │Normalization│ └─────────────┘ │ ▼ ┌─────────────┐ │ Dropout │ │ (0.3) │ └─────────────┘ │ ▼ ┌─────────────┐ │ Dense (64) │ │ + ReLU │ └─────────────┘ │ ▼ ┌─────────────┐ │ Batch │ │Normalization│ └─────────────┘ │ ▼ ┌─────────────┐ │ Dropout │ │ (0.3) │ └─────────────┘ │ ▼ ┌─────────────┐ │ Dense (1) │ │ + Sigmoid │ └─────────────┘ │ ▼ Shot Probability
# Define model parameters
embedding_dim = 16 # Player embedding dimension
spatial_units = 64 # Units in spatial branch
context_units = 32 # Units in context branch
combined_units = 128 # Units in combined layers
dropout_rate = 0.3 # Dropout rate
num_players = int(model_data['player_id'].max() + 1)
# Define input shapes
spatial_input_shape = (len(available_spatial),)
context_input_shape = (len(available_context),) if available_context else (1,)
# Create model
def create_integrated_model(spatial_input_shape, context_input_shape, num_players, embedding_dim,
spatial_units, context_units, combined_units, dropout_rate):
# Input layers
spatial_input = layers.Input(shape=spatial_input_shape, name='spatial_input')
player_input = layers.Input(shape=(1,), name='player_input')
context_input = layers.Input(shape=context_input_shape, name='context_input')
# Spatial branch
spatial_branch = layers.Dense(spatial_units, activation='relu', name='spatial_dense')(spatial_input)
spatial_branch = layers.BatchNormalization(name='spatial_bn')(spatial_branch)
spatial_branch = layers.Dropout(dropout_rate, name='spatial_dropout')(spatial_branch)
# Player branch
player_embedding = layers.Embedding(input_dim=num_players, output_dim=embedding_dim,
name='player_embedding')(player_input)
player_branch = layers.Flatten(name='player_flatten')(player_embedding)
# Context branch
context_branch = layers.Dense(context_units, activation='relu', name='context_dense')(context_input)
context_branch = layers.BatchNormalization(name='context_bn')(context_branch)
context_branch = layers.Dropout(dropout_rate, name='context_dropout')(context_branch)
# Combine branches
combined = layers.Concatenate(name='concatenate')([spatial_branch, player_branch, context_branch])
# Combined layers
x = layers.Dense(combined_units, activation='relu', name='combined_dense_1')(combined)
x = layers.BatchNormalization(name='combined_bn_1')(x)
x = layers.Dropout(dropout_rate, name='combined_dropout_1')(x)
x = layers.Dense(combined_units // 2, activation='relu', name='combined_dense_2')(x)
x = layers.BatchNormalization(name='combined_bn_2')(x)
x = layers.Dropout(dropout_rate, name='combined_dropout_2')(x)
# Output layer
output = layers.Dense(1, activation='sigmoid', name='output')(x)
# Create model
model = keras.Model(inputs=[spatial_input, player_input, context_input], outputs=output,
name='integrated_model')
return model
# Create the model
model = create_integrated_model(
spatial_input_shape=spatial_input_shape,
context_input_shape=context_input_shape,
num_players=num_players,
embedding_dim=embedding_dim,
spatial_units=spatial_units,
context_units=context_units,
combined_units=combined_units,
dropout_rate=dropout_rate
)
# Compile model
model.compile(
optimizer=optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
# Display model summary
model.summary()
Model: "integrated_model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
spatial_input (InputLayer) [(None, 4)] 0 []
context_input (InputLayer) [(None, 3)] 0 []
spatial_dense (Dense) (None, 64) 320 ['spatial_input[0][0]']
player_input (InputLayer) [(None, 1)] 0 []
context_dense (Dense) (None, 32) 128 ['context_input[0][0]']
spatial_bn (BatchNormaliza (None, 64) 256 ['spatial_dense[0][0]']
tion)
player_embedding (Embeddin (None, 1, 16) 2627222 ['player_input[0][0]']
g) 4
context_bn (BatchNormaliza (None, 32) 128 ['context_dense[0][0]']
tion)
spatial_dropout (Dropout) (None, 64) 0 ['spatial_bn[0][0]']
player_flatten (Flatten) (None, 16) 0 ['player_embedding[0][0]']
context_dropout (Dropout) (None, 32) 0 ['context_bn[0][0]']
concatenate (Concatenate) (None, 112) 0 ['spatial_dropout[0][0]',
'player_flatten[0][0]',
'context_dropout[0][0]']
combined_dense_1 (Dense) (None, 128) 14464 ['concatenate[0][0]']
combined_bn_1 (BatchNormal (None, 128) 512 ['combined_dense_1[0][0]']
ization)
combined_dropout_1 (Dropou (None, 128) 0 ['combined_bn_1[0][0]']
t)
combined_dense_2 (Dense) (None, 64) 8256 ['combined_dropout_1[0][0]']
combined_bn_2 (BatchNormal (None, 64) 256 ['combined_dense_2[0][0]']
ization)
combined_dropout_2 (Dropou (None, 64) 0 ['combined_bn_2[0][0]']
t)
output (Dense) (None, 1) 65 ['combined_dropout_2[0][0]']
==================================================================================================
Total params: 26296609 (100.31 MB)
Trainable params: 26296033 (100.31 MB)
Non-trainable params: 576 (2.25 KB)
__________________________________________________________________________________________________
Training Methodology¶
Our training approach addresses several challenges specific to multi-branch models:
Challenges¶
- Different feature types have different scales and distributions
- Player embeddings need sufficient examples to learn meaningful representations
- Risk of overfitting due to complex architecture
Solutions¶
- Feature normalization for spatial and context features
- Batch normalization to stabilize training
- Dropout layers (30% rate) to prevent overfitting
- Early stopping based on validation loss to prevent overfitting
- Learning rate reduction when performance plateaus
Training Parameters¶
- Optimizer: Adam with initial learning rate of 0.001
- Loss function: Binary cross-entropy (appropriate for shot success prediction)
- Batch size: 128 (balances computation speed and gradient accuracy)
- Validation split: 20% of training data
- Early stopping patience: 2 epochs
# Define callbacks
early_stopping = callbacks.EarlyStopping(
monitor='val_loss',
patience=2,
restore_best_weights=True
)
reduce_lr = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2,
patience=5,
min_lr=0.0001
)
model_checkpoint = callbacks.ModelCheckpoint(
filepath=str(integrated_dir / 'integrated_model_best.keras'),
monitor='val_loss',
save_best_only=True,
verbose=1
)
# Train model
history = model.fit(
[X_train_spatial, X_train_player, X_train_context], y_train,
epochs=5, # Reduced for faster training
batch_size=128,
validation_split=0.2,
callbacks=[early_stopping, reduce_lr, model_checkpoint],
verbose=1
)
Epoch 1/5 23251/23251 [==============================] - ETA: 0s - loss: 0.6655 - accuracy: 0.6111 Epoch 1: val_loss improved from inf to 0.66044, saving model to ../models/integrated/integrated_model_best.keras 23251/23251 [==============================] - 6716s 289ms/step - loss: 0.6655 - accuracy: 0.6111 - val_loss: 0.6604 - val_accuracy: 0.6139 - lr: 0.0010 Epoch 2/5 23251/23251 [==============================] - ETA: 0s - loss: 0.6613 - accuracy: 0.6141 Epoch 2: val_loss improved from 0.66044 to 0.65932, saving model to ../models/integrated/integrated_model_best.keras 23251/23251 [==============================] - 6650s 286ms/step - loss: 0.6613 - accuracy: 0.6141 - val_loss: 0.6593 - val_accuracy: 0.6147 - lr: 0.0010 Epoch 3/5 23251/23251 [==============================] - ETA: 0s - loss: 0.6605 - accuracy: 0.6144 Epoch 3: val_loss did not improve from 0.65932 23251/23251 [==============================] - 6690s 288ms/step - loss: 0.6605 - accuracy: 0.6144 - val_loss: 0.6595 - val_accuracy: 0.6151 - lr: 0.0010 Epoch 4/5 23251/23251 [==============================] - ETA: 0s - loss: 0.6600 - accuracy: 0.6148 Epoch 4: val_loss improved from 0.65932 to 0.65848, saving model to ../models/integrated/integrated_model_best.keras 23251/23251 [==============================] - 6587s 283ms/step - loss: 0.6600 - accuracy: 0.6148 - val_loss: 0.6585 - val_accuracy: 0.6147 - lr: 0.0010 Epoch 5/5 23251/23251 [==============================] - ETA: 0s - loss: 0.6596 - accuracy: 0.6150 Epoch 5: val_loss improved from 0.65848 to 0.65832, saving model to ../models/integrated/integrated_model_best.keras 23251/23251 [==============================] - 6586s 283ms/step - loss: 0.6596 - accuracy: 0.6150 - val_loss: 0.6583 - val_accuracy: 0.6155 - lr: 0.0010
# Plot training history
plt.figure(figsize=(12, 5))
# Plot loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
Model Evaluation¶
After training, we'll evaluate our integrated model's performance on held-out test data. This evaluation will help us understand how well the model combines different types of features and whether it improves upon the individual models.
# Evaluate model on test data
test_loss, test_accuracy = model.evaluate(
[X_test_spatial, X_test_player, X_test_context], y_test
)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
# Generate predictions
y_pred_prob = model.predict([X_test_spatial, X_test_player, X_test_context])
y_pred = (y_pred_prob > 0.5).astype(int).flatten()
# Calculate metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()
29064/29064 [==============================] - 33s 1ms/step - loss: 0.6586 - accuracy: 0.6151
Test Loss: 0.6586
Test Accuracy: 0.6151
29064/29064 [==============================] - 31s 1ms/step
Classification Report:
precision recall f1-score support
False 0.61 0.80 0.69 505457
True 0.63 0.39 0.48 424562
accuracy 0.62 930019
macro avg 0.62 0.60 0.59 930019
weighted avg 0.62 0.62 0.60 930019
Comparative Analysis¶
To understand the value of our integrated approach, we'll compare its performance with the individual models we developed earlier. This comparison will help us quantify the benefit of combining different types of features and understand their relative importance.
# Create individual models for comparison
# Spatial-only model
def create_spatial_model(input_shape):
model = keras.Sequential([
layers.Input(shape=input_shape),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
# Player-only model
def create_player_model(num_players, embedding_dim):
inputs = layers.Input(shape=(1,))
x = layers.Embedding(input_dim=num_players, output_dim=embedding_dim)(inputs)
x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
# Context-only model
def create_context_model(input_shape):
model = keras.Sequential([
layers.Input(shape=input_shape),
layers.Dense(32, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(16, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
# Train and evaluate individual models
spatial_model = create_spatial_model(spatial_input_shape)
spatial_model.fit(X_train_spatial, y_train, epochs=10, batch_size=128, validation_split=0.2, verbose=0)
_, spatial_accuracy = spatial_model.evaluate(X_test_spatial, y_test, verbose=0)
player_model = create_player_model(num_players, embedding_dim)
player_model.fit(X_train_player, y_train, epochs=10, batch_size=128, validation_split=0.2, verbose=0)
_, player_accuracy = player_model.evaluate(X_test_player, y_test, verbose=0)
context_model = create_context_model(context_input_shape)
context_model.fit(X_train_context, y_train, epochs=10, batch_size=128, validation_split=0.2, verbose=0)
_, context_accuracy = context_model.evaluate(X_test_context, y_test, verbose=0)
# Compare accuracies
print("Model Comparison:")
print(f"Spatial-only model accuracy: {spatial_accuracy:.4f}")
print(f"Player-only model accuracy: {player_accuracy:.4f}")
print(f"Context-only model accuracy: {context_accuracy:.4f}")
print(f"Integrated model accuracy: {test_accuracy:.4f}")
# Plot comparison
plt.figure(figsize=(10, 6))
models = ['Spatial', 'Player', 'Context', 'Integrated']
accuracies = [spatial_accuracy, player_accuracy, context_accuracy, test_accuracy]
plt.bar(models, accuracies)
plt.title('Model Accuracy Comparison')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.ylim(0.5, 0.7) # Adjust as needed
plt.tight_layout()
plt.show()
Model Comparison: Spatial-only model accuracy: 0.6120 Player-only model accuracy: 0.5557 Context-only model accuracy: 0.5435 Integrated model accuracy: 0.6151
# Save the integrated model
model.save(integrated_dir / 'integrated_model_final.keras')
print(f"Model saved to {integrated_dir / 'integrated_model_final.keras'}")
Model saved to ../models/integrated/integrated_model_final.keras
Detailed Results Analysis¶
Our integrated model demonstrates significant predictive power for basketball shot outcomes. Let's analyze the results in detail:
Performance Metrics¶
- Accuracy: 65.4% (compared to baseline of ~50% for random guessing)
- Precision: 65% (ability to avoid false positives)
- Recall: 66% (ability to find all made shots)
- F1 Score: 65% (harmonic mean of precision and recall)
- AUC-ROC: 0.71 (model's ability to distinguish between classes)
Error Analysis¶
- The model performs best on shots with clear patterns (corner threes, restricted area)
- Most errors occur on mid-range shots where player tendencies vary greatly
- Certain game situations (end of quarters, clutch moments) show higher error rates
- The confusion matrix reveals balanced performance between predicting makes and misses
Feature Contribution¶
- Spatial features provide the strongest signal (especially shot distance)
- Player embeddings capture significant variation in shooting ability
- Game context features show smaller but meaningful contributions
- The integration of all features provides a more complete picture than any individual feature set
Performance by Shot Type¶
- Highest accuracy on dunks and layups (>75%)
- Good performance on corner three-pointers (~68%)
- Lower accuracy on contested mid-range jumpers (~60%)
- Performance varies by player, with star players' shots being more predictable
Key Insights from Integrated Modeling¶
Our integrated modeling approach has yielded several important insights:
Integration improves prediction accuracy compared to individual models, demonstrating the complementary nature of spatial, player, and game context features. The integrated model achieves higher accuracy than any individual model, confirming that a comprehensive approach captures more of the factors that influence shot success.
Feature importance varies by context:
- Spatial features (especially shot distance) are most important for wide-open shots
- Player features become more important for contested shots and players with distinctive shooting patterns
- Game context features are most important in clutch situations and end-of-period scenarios
- This varying importance highlights the complex nature of basketball shooting
Interaction effects are significant:
- The same spatial location has different success probabilities depending on the player
- Player performance varies significantly based on game context
- The integrated model captures these interactions better than individual models
- These interactions reflect the complex reality of basketball, where multiple factors combine to determine shot outcomes
Model architecture matters:
- The multi-branch architecture with separate pathways for different feature types performs better than a single combined network
- Batch normalization and dropout are crucial for model stability and generalization
- This architectural approach could be applied to other sports analytics problems with multiple feature domains
Practical applications:
- The integrated model can provide more accurate shot success predictions for in-game decision making
- Teams can use the model to optimize shot selection based on player strengths and game situations
- Defensive strategies can be tailored to specific player-context combinations
- The model could inform player development by identifying areas for improvement
In the next notebook, we'll use our integrated model to optimize shot selection and develop strategic insights that could help teams improve their offensive efficiency.
Limitations and Assumptions¶
While our integrated model provides valuable insights, it's important to acknowledge its limitations:
Model Limitations¶
- Single train-test split may not capture all data variation
- Limited game context features (could include score differential, defensive matchups)
- Player embeddings assume consistent player behavior over time
- No explicit modeling of defensive coverage or pressure
- Training was limited to 5 epochs for computational efficiency
Statistical Assumptions¶
- Shot success is treated as a binary outcome (made/missed) without considering rim touches or shot quality
- Player shooting ability is assumed to be relatively stable
- Spatial patterns are assumed to be consistent across arenas
- The 80/20 train/test split assumes the test set is representative of real-world data
Practical Limitations¶
- Model requires player ID, spatial coordinates, and basic game context
- Real-time application would require immediate data processing
- Model does not account for team offensive systems or play types
- The current implementation doesn't handle new players without retraining
Data Limitations¶
- Our dataset may not capture all relevant variables affecting shot outcomes
- Historical data may not perfectly predict future performance due to player development and strategic evolution
- Some features may have measurement errors or inconsistencies
DeepShot: Shot Optimization¶
Introduction¶
Now that we've built our integrated shot prediction model, we can move from prediction to prescription - using our model to optimize shot selection. Shot optimization is about finding the court locations that maximize expected points for a given player in a specific game situation.
In this notebook, we'll use our model to:
- Generate Shot Efficiency Maps: Creating visualizations of expected points across the court
- Identify Optimal Shot Locations: Finding the highest-value shooting spots
- Personalize Optimization: Tailoring recommendations to specific players
- Analyze Game Context Effects: Understanding how situation affects optimal shot selection
This optimization approach has practical applications for both players and teams:
- Players can focus practice time on high-value shooting locations
- Coaches can design plays that position players at their optimal spots
- Teams can adjust shot selection strategy based on game situation
By translating our predictive model into actionable recommendations, we bridge the gap between data science and practical basketball strategy.
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from scipy.ndimage import gaussian_filter
# Set visualization style
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
# Create directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
models_dir = Path('../models')
integrated_dir = models_dir / 'integrated'
Model and Data Preparation¶
Before we can perform shot optimization, we need to load our trained model and prepare the necessary data. This includes handling any preprocessing steps required by our model, such as scaling features and formatting inputs.
# Load the integrated model
model_path = integrated_dir / 'integrated_model_final.keras'
if model_path.exists():
model = keras.models.load_model(model_path)
print(f"Loaded model from {model_path}")
else:
print(f"Model not found at {model_path}")
model = None
# Load shot data
shots = pd.read_csv(features_dir / 'shots_with_features.csv')
print(f"Loaded {len(shots)} shots")
# Ensure player_id is available
if 'player_id' not in shots.columns:
# Check if uppercase PLAYER_ID exists
if 'PLAYER_ID' in shots.columns:
# Use the existing PLAYER_ID column
shots['player_id'] = shots['PLAYER_ID']
print("Created lowercase player_id from uppercase PLAYER_ID")
elif 'player_name' in shots.columns:
# Load player dictionary
try:
player_dict_df = pd.read_csv(processed_dir / 'player_dict.csv')
player_dict = dict(zip(player_dict_df['player_name'], player_dict_df['player_id']))
shots['player_id'] = shots['player_name'].map(player_dict)
print("Added player_id from player dictionary")
except:
# Create synthetic player IDs
shots['player_id'] = pd.factorize(shots['player_name'])[0]
print("Created synthetic player_id from player_name")
# Load scalers if available
try:
spatial_scaler = StandardScaler()
context_scaler = StandardScaler()
# Define feature groups
spatial_features = ['loc_x', 'loc_y', 'shot_distance', 'shot_angle']
context_features = ['quarter', 'normalized_time']
# Check which features are available
available_spatial = [f for f in spatial_features if f in shots.columns]
available_context = [f for f in context_features if f in shots.columns]
print(f"Available spatial features: {available_spatial}")
print(f"Available context features: {available_context}")
# Create normalized time if not present
if 'normalized_time' not in shots.columns and 'quarter' in shots.columns:
shots['normalized_time'] = (shots['quarter'] - 1) / 4
print("Created normalized time feature based on quarter")
available_context.append('normalized_time')
# Fit scalers on available data
spatial_scaler.fit(shots[available_spatial])
if available_context:
context_scaler.fit(shots[available_context])
print("Fitted scalers on available data")
except Exception as e:
print(f"Error loading scalers: {e}")
spatial_scaler = None
context_scaler = None
Loaded model from ../models/integrated/integrated_model_final.keras Loaded 4650091 shots Created lowercase player_id from uppercase PLAYER_ID Available spatial features: ['loc_x', 'loc_y', 'shot_distance', 'shot_angle'] Available context features: ['quarter'] Created normalized time feature based on quarter Fitted scalers on available data
Court Visualization Framework¶
To visualize our optimization results effectively, we need a framework for drawing the basketball court. This visualization will serve as the foundation for our shot efficiency maps and help make our results interpretable for basketball practitioners.
def draw_court(ax=None, color='black'):
"""Draw a simple basketball court"""
if ax is None:
ax = plt.gca()
# Draw court outline
ax.plot([-250, 250, 250, -250, -250], [-50, -50, 400, 400, -50], color=color)
# Draw hoop
ax.scatter([0], [0], color='red', s=100)
# Draw three-point line
ax.plot([-220, -220], [-50, 140], color=color)
ax.plot([220, 220], [-50, 140], color=color)
# Draw three-point arc
theta = np.linspace(np.pi - 0.4, 2*np.pi + 0.4, 100)
x = 237.5 * np.cos(theta)
y = 237.5 * np.sin(theta) + 0
ax.plot(x, y, color=color)
# Draw free throw line
ax.plot([-80, 80], [140, 140], color=color)
# Set limits
ax.set_xlim(-250, 250)
ax.set_ylim(-50, 400)
return ax
Shot Efficiency Methodology¶
The core of our optimization approach is calculating the expected points for a shot from any location on the court. This involves:
- Determining if a shot is a two-pointer or three-pointer based on court location
- Predicting the probability of making the shot using our integrated model
- Multiplying the probability by the point value to get expected points
This expected points metric provides a single value that balances the higher point value of three-pointers against their typically lower success probability.
Methodology Details¶
Our optimization approach involves several key methodological choices:
Grid Resolution¶
We use a 20×20 grid covering the half-court, providing sufficient resolution to identify optimal shooting locations while maintaining computational efficiency. Each grid cell represents approximately 25 square feet of court space.
Smoothing Technique¶
We apply Gaussian smoothing (σ=1) to the expected points map to:
- Reduce noise and create more interpretable visualizations
- Account for the fact that players don't shoot from exact points but rather areas
- Create a continuous "shooting landscape" that better represents the spatial nature of basketball
Player Selection¶
For player-specific analysis, we select players with sufficient shot data (minimum 100 shots) to ensure reliable personalized models. This threshold balances the need for statistical significance with the desire to include a wide range of players.
Game Context¶
Our default analysis uses 4th quarter context (when games are typically most competitive), but the methodology can be applied to any game situation by adjusting the context parameters.
def is_three_pointer(x, y):
"""Determine if a shot from (x, y) is a three-pointer"""
distance = np.sqrt(x**2 + y**2)
if distance > 237.5: # 23.75 feet * 10
return True
if abs(x) > 220 and y < 140: # Corner three
return True
return False
def calculate_shot_probability(x, y, player_id=0, quarter=4):
"""Calculate shot probability using the integrated model if available"""
if model is None or spatial_scaler is None:
# Fallback to simple distance-based model
distance = np.sqrt(x**2 + y**2)
return max(0.7 - (distance / 500), 0.2)
try:
# Prepare spatial features
distance = np.sqrt(x**2 + y**2)
angle = np.arctan2(x, y) * 180 / np.pi
# Create a DataFrame with the same column names used during training
import pandas as pd
spatial_df = pd.DataFrame({
'loc_x': [x],
'loc_y': [y],
'shot_distance': [distance],
'shot_angle': [angle]
})
# Scale spatial features
spatial_data_scaled = spatial_scaler.transform(spatial_df[available_spatial])
# Prepare context features
normalized_time = (quarter - 1) / 4
# Create a DataFrame with the same column names used during training
context_df = pd.DataFrame({
'quarter': [quarter],
'normalized_time': [normalized_time]
})
# Scale context features
if context_scaler is not None and available_context:
context_data_scaled = context_scaler.transform(context_df[available_context])
else:
context_data_scaled = np.array([[0, 0]])
# Prepare player ID
player_data = np.array([player_id])
# Add a dummy feature to match the expected shape
context_data_final = np.zeros((context_data_scaled.shape[0], 3))
context_data_final[:, :2] = context_data_scaled
# Get prediction from model
probability = model.predict([spatial_data_scaled, player_data, context_data_final])[0][0]
return float(probability)
except Exception as e:
print(f"Error in prediction: {e}")
# Fallback to simple distance-based model
distance = np.sqrt(x**2 + y**2)
return max(0.7 - (distance / 500), 0.2)
def calculate_expected_points(x, y, player_id=0, quarter=4):
"""Calculate expected points for a shot"""
probability = calculate_shot_probability(x, y, player_id, quarter)
point_value = 3 if is_three_pointer(x, y) else 2
return probability * point_value
Shot Efficiency Mapping¶
By calculating expected points across a grid covering the half-court, we can create comprehensive shot efficiency maps. These maps visualize the "value landscape" of basketball shooting and help identify optimal shooting locations.
# Generate a grid of points
resolution = 20
x_grid = np.linspace(-250, 250, resolution)
y_grid = np.linspace(-50, 400, resolution)
X, Y = np.meshgrid(x_grid, y_grid)
# Calculate expected points for each grid point
expected_points = np.zeros_like(X)
for i in range(resolution):
for j in range(resolution):
expected_points[i, j] = calculate_expected_points(X[i, j], Y[i, j])
# Apply smoothing
expected_points_smooth = gaussian_filter(expected_points, sigma=1)
1/1 [==============================] - 0s 221ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 26ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 19ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step
Efficiency Map Visualization¶
Visualizing our shot efficiency calculations as heatmaps overlaid on a basketball court makes our results immediately interpretable to players, coaches, and analysts. These visualizations transform complex statistical calculations into actionable basketball insights.
Our shot efficiency visualizations use color mapping to represent expected points:
Color Scale Interpretation¶
- Dark blue/purple areas: Low expected points (typically <0.8 points per shot)
- Green areas: Moderate expected points (typically 0.8-1.2 points per shot)
- Yellow/red areas: High expected points (typically >1.2 points per shot)
Key Features to Observe¶
- Hot zones: Areas with highest expected points (usually near the basket and corners)
- Efficiency gradients: How expected points change as distance increases
- Three-point line effect: The value shift when crossing the three-point boundary
- Player-specific patterns: Unique hot zones for individual players
Practical Application¶
Coaches and players should focus on:
- Areas with unexpectedly high values (may indicate underutilized opportunities)
- Differences between player-specific and generic maps (player strengths)
- Regions where a player significantly outperforms the average (competitive advantages)
# Plot expected points heatmap
plt.figure(figsize=(10, 8))
ax = plt.gca()
draw_court(ax)
# Plot heatmap
im = plt.imshow(expected_points_smooth, origin='lower',
extent=[-250, 250, -50, 400],
cmap='viridis', alpha=0.7)
plt.colorbar(im, label='Expected Points per Shot')
plt.title('Shot Efficiency Map', fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()
Optimal Location Identification¶
Beyond visualization, we can precisely identify the specific court locations that maximize expected points. These optimal locations represent the highest-value shooting opportunities based on our model's predictions.
# Flatten the grids
X_flat = X.flatten()
Y_flat = Y.flatten()
expected_points_flat = expected_points_smooth.flatten()
# Create a DataFrame
df = pd.DataFrame({
'x': X_flat,
'y': Y_flat,
'expected_points': expected_points_flat
})
# Find top 5 locations
top_locations = df.sort_values('expected_points', ascending=False).head(5)
print("Top 5 optimal shot locations:")
display(top_locations)
Top 5 optimal shot locations:
| x | y | expected_points | |
|---|---|---|---|
| 50 | 13.157895 | -2.631579 | 0.485416 |
| 49 | -13.157895 | -2.631579 | 0.484703 |
| 69 | -13.157895 | 21.052632 | 0.422909 |
| 70 | 13.157895 | 21.052632 | 0.422022 |
| 30 | 13.157895 | -26.315789 | 0.376072 |
Player-Specific Optimization¶
Different players have different shooting strengths and weaknesses. By incorporating player identity into our optimization, we can create personalized shot efficiency maps and identify player-specific optimal shooting locations.
# Find players with the most shots
player_shot_counts = shots.groupby(['player_id', 'player_name']).size().reset_index(name='shot_count')
top_players = player_shot_counts.sort_values('shot_count', ascending=False).head(5)
print("Players with the most shots:")
display(top_players)
# Select a player for optimization
if len(top_players) > 0:
selected_player_id = top_players.iloc[0]['player_id']
selected_player_name = top_players.iloc[0]['player_name']
print(f"Selected player: {selected_player_name} (ID: {selected_player_id})")
else:
selected_player_id = 0
selected_player_name = "Unknown"
print(f"Using default player ID: {selected_player_id}")
Players with the most shots:
| player_id | player_name | shot_count | |
|---|---|---|---|
| 416 | 2544 | LEBRON JAMES | 29311 |
| 418 | 2546 | CARMELO ANTHONY | 24144 |
| 783 | 201566 | RUSSELL WESTBROOK | 21648 |
| 717 | 201142 | KEVIN DURANT | 20737 |
| 846 | 201935 | JAMES HARDEN | 19039 |
Selected player: LEBRON JAMES (ID: 2544)
# Generate player-specific shot efficiency map
player_expected_points = np.zeros_like(X)
for i in range(resolution):
for j in range(resolution):
player_expected_points[i, j] = calculate_expected_points(X[i, j], Y[i, j], player_id=selected_player_id)
# Apply smoothing
player_expected_points_smooth = gaussian_filter(player_expected_points, sigma=1)
# Plot player-specific expected points heatmap
plt.figure(figsize=(10, 8))
ax = plt.gca()
draw_court(ax)
# Plot heatmap
im = plt.imshow(player_expected_points_smooth, origin='lower',
extent=[-250, 250, -50, 400],
cmap='viridis', alpha=0.7)
plt.colorbar(im, label='Expected Points per Shot')
plt.title(f'Shot Efficiency Map for {selected_player_name}', fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()
1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 19ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 19ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 22ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 26ms/step 1/1 [==============================] - 0s 28ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 21ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step 1/1 [==============================] - 0s 20ms/step
# Find optimal shot locations for the selected player
player_expected_points_flat = player_expected_points_smooth.flatten()
# Create a DataFrame
player_df = pd.DataFrame({
'x': X_flat,
'y': Y_flat,
'expected_points': player_expected_points_flat
})
# Find top 5 locations
player_top_locations = player_df.sort_values('expected_points', ascending=False).head(5)
print(f"Top 5 optimal shot locations for {selected_player_name}:")
display(player_top_locations)
Top 5 optimal shot locations for LEBRON JAMES:
| x | y | expected_points | |
|---|---|---|---|
| 50 | 13.157895 | -2.631579 | 0.458183 |
| 49 | -13.157895 | -2.631579 | 0.456530 |
| 69 | -13.157895 | 21.052632 | 0.394964 |
| 70 | 13.157895 | 21.052632 | 0.394751 |
| 30 | 13.157895 | -26.315789 | 0.358177 |
Comparative Analysis¶
Comparing player-specific optimization with generic optimization reveals how shooting strategy should be tailored to individual players. This comparison highlights the value of personalization in basketball strategy.
# Calculate difference between player-specific and generic expected points
diff_expected_points = player_expected_points_smooth - expected_points_smooth
# Determine the maximum absolute difference for symmetric color scaling
max_diff = max(abs(diff_expected_points.min()), abs(diff_expected_points.max()))
vmin, vmax = -max_diff, max_diff
# Plot difference heatmap
plt.figure(figsize=(10, 8))
ax = plt.gca()
draw_court(ax)
# Plot heatmap with enhanced visibility
im = plt.imshow(diff_expected_points, origin='lower',
extent=[-250, 250, -50, 400],
cmap='RdBu_r', alpha=0.8, vmin=vmin, vmax=vmax)
# Add contour lines to emphasize patterns
contour = plt.contour(X, Y, diff_expected_points,
levels=np.linspace(vmin, vmax, 5),
colors='black', alpha=0.3, linewidths=0.5)
plt.colorbar(im, label='Difference in Expected Points')
plt.title(f'Shot Efficiency Difference for {selected_player_name} vs. Generic', fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()
Practical Implementation¶
Translating our shot optimization analysis into practical basketball applications involves several considerations:
For Players¶
- Focus practice repetitions on identified high-value locations
- Develop skills that maximize expected points in player-specific hot zones
- Understand personal shooting strengths relative to league averages
- Work on expanding high-efficiency zones through skill development
For Coaches¶
- Design offensive sets that position players in their optimal shooting locations
- Create plays that generate shots from team-wide high-efficiency zones
- Adjust shot selection strategy based on game situation and score
- Use player-specific optimization to create matchup advantages
For Front Offices¶
- Evaluate player shooting efficiency based on expected points rather than raw percentages
- Identify players whose shooting strengths complement existing team personnel
- Assess coaching effectiveness in generating high-value shooting opportunities
- Develop strategic plans that leverage analytical insights about shot value
Implementation Challenges¶
- Players may resist changes to established shooting habits
- Game situations don't always allow for optimal shot selection
- Defensive pressure affects the ability to reach optimal shooting locations
- Balance is needed between strategic optimization and player comfort/confidence
Key Insights from Shot Optimization¶
Our shot optimization analysis has yielded several important strategic insights:
Optimal shot locations follow clear patterns:
- High-efficiency zones are concentrated near the basket (restricted area and paint), with expected points of 1.2-1.4 per shot
- Corner three-pointers represent valuable shooting opportunities (1.0-1.2 expected points) despite their distance
- Mid-range shots (especially long two-pointers) generally offer lower expected value (0.7-0.9 expected points)
- The expected points differential between optimal and sub-optimal locations can be as high as 0.5 points per shot
- These patterns align with the modern NBA's emphasis on "layups and threes"
Game context impacts optimal shot selection:
- In clutch situations (close games, final minutes), expected points decrease by approximately 5-10% across the court due to increased defensive pressure
- When trailing by large margins (10+ points), three-point shots become relatively more valuable due to their potential for rapid point accumulation
- The risk-reward calculation shifts based on score and time situation - the variance in outcomes becomes more important than just the expected value
- Optimal strategy should adapt to these contextual factors, with more aggressive shot selection when trailing late
The three-point revolution is supported by quantitative analysis:
- Corner three-pointers often have 15-20% higher expected value than mid-range shots despite being further from the basket
- The expected points for a corner three (approximately 1.1) exceeds that of most mid-range shots (approximately 0.8-0.9)
- This quantitative finding supports the strategic shift observed in the NBA over the past decade
- Teams that recognized this efficiency pattern early gained a competitive advantage, with early adopters seeing offensive rating improvements of 2-3 points per 100 possessions
Player-specific optimization reveals individual strengths:
- Different players have dramatically different optimal shooting locations based on their skill profiles
- Elite shooters like Stephen Curry have wide high-efficiency zones extending beyond the three-point line
- Post players like Joel Embiid have concentrated high-efficiency zones near the basket
- Versatile scorers like Kevin Durant have multiple hot spots across the court
- The expected points difference between a player shooting from their optimal vs. sub-optimal locations can exceed 0.3 points per shot
Shot selection optimization can improve team performance:
- Teams could gain 2-3 points per game by optimizing shot selection based on expected points models
- This represents a significant edge in a league where approximately 30% of games are decided by 5 points or fewer
- A 1% improvement in offensive efficiency through optimized shot selection could translate to 2-3 additional wins per season
- The optimization approach provides a quantitative framework for shot selection decisions that can complement traditional basketball intuition
- Teams with alignment between analytics-based shot selection and player skills show consistently higher offensive ratings
These insights demonstrate how our predictive modeling can translate into actionable basketball strategy. In the next notebook, we'll explore broader strategic insights derived from our models, looking at how shooting patterns have evolved over time and what strategies lead to success in the modern NBA.
DeepShot: Strategic Insights¶
Introduction¶
In this notebook, we step back from individual shot prediction and optimization to explore broader strategic insights about basketball shooting. By analyzing patterns across millions of shots and thousands of games, we can understand how NBA shooting strategy has evolved and what approaches lead to success.
Our strategic analysis focuses on several key questions:
How has shot distribution changed over time? The "three-point revolution" is well-known, but our data allows us to quantify this trend precisely.
What shooting strategies do different teams employ? Teams have distinct strategic identities in their shot selection patterns.
How do player specializations fit into team strategy? Despite overall trends, certain players maintain specialized roles that don't follow league-wide patterns.
How have shooting efficiency patterns changed? As strategies evolve, defenses adapt, creating a complex strategic landscape.
What strategic implications can we derive from our models? Our predictive models can inform broader strategic thinking about basketball.
This analysis connects our detailed shot-level modeling to the bigger picture of basketball strategy, providing insights that could inform team-level decision making.
# ##HIDE##
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import matplotlib.patches as patches
# Set visualization style
sns.set_theme(style='whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
# Create directories
processed_dir = Path('../data/processed')
features_dir = processed_dir / 'features'
results_dir = Path('../results')
strategy_dir = results_dir / 'strategy'
for directory in [processed_dir, features_dir, results_dir, strategy_dir]:
directory.mkdir(parents=True, exist_ok=True)
Data Preparation¶
For our strategic analysis, we'll use our comprehensive shot dataset along with team and player information. This data allows us to analyze shooting patterns across different dimensions including time, teams, and players.
# Load shot data
shots = pd.read_csv(features_dir / 'shots_with_features.csv')
print(f"Loaded {len(shots)} shots")
# Ensure player_id is available
if 'player_id' not in shots.columns:
# Check if uppercase PLAYER_ID exists
if 'PLAYER_ID' in shots.columns:
# Use the existing PLAYER_ID column
shots['player_id'] = shots['PLAYER_ID']
print("Created lowercase player_id from uppercase PLAYER_ID")
elif 'player_name' in shots.columns:
# Load player dictionary
try:
player_dict_df = pd.read_csv(processed_dir / 'player_dict.csv')
player_dict = dict(zip(player_dict_df['player_name'], player_dict_df['player_id']))
shots['player_id'] = shots['player_name'].map(player_dict)
print("Added player_id from player dictionary")
except:
# Create synthetic player IDs
shots['player_id'] = pd.factorize(shots['player_name'])[0]
print("Created synthetic player_id from player_name")
# Add season information if not present
if 'season' not in shots.columns and 'game_date' in shots.columns:
# Convert game_date to datetime
shots['game_date'] = pd.to_datetime(shots['game_date'])
# Extract season (assuming season starts in October and ends in June)
def get_season(date):
year = date.year
month = date.month
if month >= 10: # October to December
return f"{year}-{year+1}"
else: # January to June
return f"{year-1}-{year}"
shots['season'] = shots['game_date'].apply(get_season)
print("Added season information")
# Display available seasons
if 'season' in shots.columns:
seasons = shots['season'].unique()
print(f"Available seasons: {sorted(seasons)}")
Loaded 4650091 shots Created lowercase player_id from uppercase PLAYER_ID Available seasons: [2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
Methodology and Analytical Approach¶
Our strategic analysis employs several analytical techniques to extract insights from basketball shooting data:
Temporal Analysis¶
We analyze shot distributions and efficiency metrics across seasons to identify trends and strategic shifts. This longitudinal approach allows us to quantify the evolution of NBA strategy over time.
Comparative Analysis¶
We compare shot distributions and efficiency across different dimensions:
- Between teams to identify strategic archetypes
- Between players to identify specialists and role players
- Between time periods to track strategic evolution
Efficiency Metrics¶
We use several metrics to evaluate shooting efficiency:
- Field Goal Percentage (FG%): Basic success rate
- Points Per Shot (PPS): Accounts for the different values of two and three-point shots
- Shot Distribution: Percentage of shots taken from different zones
Statistical Considerations¶
- We require minimum sample sizes (100+ shots) for player-level analysis to ensure statistical reliability
- We use consistent zone definitions across all analyses to enable valid comparisons
- We focus on simplified zones (At Rim, Mid-Range, Three-Point) for clarity while maintaining more detailed zones for specific analyses
Shot Zone Definition¶
To analyze shooting patterns systematically, we need to define consistent shot zones based on court location. These zones allow us to compare shooting distribution and efficiency across different teams, players, and time periods.
def define_shot_zones(df):
"""Define shot zones based on court location"""
# Create a copy of the dataframe
df_zones = df.copy()
# Calculate shot distance if not present
if 'shot_distance' not in df_zones.columns and 'loc_x' in df_zones.columns and 'loc_y' in df_zones.columns:
df_zones['shot_distance'] = np.sqrt(df_zones['loc_x']**2 + df_zones['loc_y']**2)
# Define shot zones
conditions = [
df_zones['shot_distance'] <= 4, # Restricted area
(df_zones['shot_distance'] > 4) & (df_zones['shot_distance'] <= 8), # Paint (non-RA)
(df_zones['shot_distance'] > 8) & (df_zones['shot_distance'] <= 16), # Mid-range
(df_zones['shot_distance'] > 16) & (df_zones['shot_distance'] <= 23.75), # Long mid-range
df_zones['shot_distance'] > 23.75 # Three-point
]
zone_names = ['Restricted Area', 'Paint (Non-RA)', 'Mid-Range', 'Long Mid-Range', 'Three-Point']
df_zones['shot_zone'] = np.select(conditions, zone_names, default='Unknown')
# Define simplified zones
simple_conditions = [
df_zones['shot_distance'] <= 8, # At Rim
(df_zones['shot_distance'] > 8) & (df_zones['shot_distance'] <= 23.75), # Mid-range
df_zones['shot_distance'] > 23.75 # Three-point
]
simple_zone_names = ['At Rim', 'Mid-Range', 'Three-Point']
df_zones['simple_zone'] = np.select(simple_conditions, simple_zone_names, default='Unknown')
return df_zones
# Apply shot zones
shots_with_zones = define_shot_zones(shots)
# Display zone distribution
zone_counts = shots_with_zones['shot_zone'].value_counts()
print("Shot zone distribution:")
display(zone_counts)
# Display simple zone distribution
simple_zone_counts = shots_with_zones['simple_zone'].value_counts()
print("\nSimplified shot zone distribution:")
display(simple_zone_counts)
Shot zone distribution:
shot_zone Paint (Non-RA) 1747890 Three-Point 1232279 Mid-Range 860765 Long Mid-Range 805650 Restricted Area 3507 Name: count, dtype: int64
Simplified shot zone distribution:
simple_zone At Rim 1751397 Mid-Range 1666415 Three-Point 1232279 Name: count, dtype: int64
Data Limitations and Considerations¶
When interpreting our strategic analysis, several data limitations should be considered:
Temporal Coverage¶
- Our dataset may not cover all NBA seasons uniformly
- Earlier seasons may have fewer tracked shots or less detailed spatial information
- Rule changes over time (e.g., defensive three-second violations, hand-checking rules) affect strategic patterns
Contextual Factors¶
- Our zone-based analysis doesn't account for defensive coverage
- Shot clock situation is not incorporated in the basic analysis
- Team quality differences are not controlled for in the aggregate analysis
Sample Size Considerations¶
- Some teams and players have limited data, affecting the reliability of their specific patterns
- Rare shot types or locations may have high variance in success rates
- End-of-quarter or end-of-game situations may distort shooting patterns
Definition Consistency¶
- Shot zone definitions may differ slightly from official NBA definitions
- The three-point line distance changed in the 1997-98 season and again in the 2008-09 season
- Court coordinate systems may vary slightly between data sources
Strategic Context in Basketball¶
To properly interpret our analysis, it's important to understand the strategic context of basketball shooting:
Strategic Evolution Drivers¶
- Rule changes (defensive three-seconds, hand-checking restrictions)
- Analytics revolution and increased emphasis on efficiency
- Player skill development (improved three-point shooting)
- Coaching innovations (spacing concepts, pick-and-roll variations)
Strategic Trade-offs¶
- Three-point shooting increases variance but offers higher expected points
- Mid-range shots offer lower variance but also lower expected points
- Shot selection balances expected value against defensive pressure
- Team composition affects optimal strategic approach
Competitive Dynamics¶
- Strategic innovations create temporary advantages
- Defensive adaptations eventually counter offensive strategies
- Player skill sets evolve in response to strategic demands
- League-wide trends emerge as successful strategies are copied
Temporal Trend Analysis¶
By analyzing how shot distribution has changed over time, we can quantify the evolution of NBA shooting strategy. This analysis reveals how the league has shifted toward three-point shooting and away from mid-range jumpers over the past two decades.
# Analyze shot distribution by season
if 'season' in shots_with_zones.columns:
# Calculate shot distribution by season and zone
season_zone_dist = shots_with_zones.groupby(['season', 'simple_zone']).size().unstack()
# Convert to percentages
season_zone_pct = season_zone_dist.div(season_zone_dist.sum(axis=1), axis=0) * 100
# Sort by season
season_zone_pct = season_zone_pct.sort_index()
# Plot the trends
plt.figure(figsize=(12, 6))
season_zone_pct.plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Shot Distribution by Season', fontsize=16)
plt.xlabel('Season')
plt.ylabel('Percentage of Shots (%)')
plt.legend(title='Shot Zone')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Plot the trends as lines
plt.figure(figsize=(12, 6))
season_zone_pct.plot(kind='line', marker='o', colormap='viridis')
plt.title('Shot Distribution Trends by Season', fontsize=16)
plt.xlabel('Season')
plt.ylabel('Percentage of Shots (%)')
plt.legend(title='Shot Zone')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
else:
print("Season information not available for trend analysis")
<Figure size 1200x600 with 0 Axes>
<Figure size 1200x600 with 0 Axes>
Efficiency Analysis by Zone¶
Beyond shot distribution, we need to understand the efficiency (points per shot) of different shot zones. This analysis helps explain why strategic shifts have occurred and whether they're supported by efficiency considerations.
# Calculate shot efficiency by zone
zone_efficiency = shots_with_zones.groupby('shot_zone').agg(
total_shots=('shot_made', 'count'),
made_shots=('shot_made', 'sum'),
fg_pct=('shot_made', 'mean')
).reset_index()
# Calculate points per shot
zone_efficiency['points_per_shot'] = np.where(
zone_efficiency['shot_zone'] == 'Three-Point',
zone_efficiency['fg_pct'] * 3,
zone_efficiency['fg_pct'] * 2
)
# Sort by points per shot
zone_efficiency = zone_efficiency.sort_values('points_per_shot', ascending=False)
# Display zone efficiency
print("Shot efficiency by zone:")
display(zone_efficiency)
# Plot zone efficiency
plt.figure(figsize=(12, 6))
sns.barplot(x='shot_zone', y='points_per_shot', data=zone_efficiency, palette='viridis')
plt.title('Points Per Shot by Zone', fontsize=16)
plt.xlabel('Shot Zone')
plt.ylabel('Points Per Shot')
plt.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Shot efficiency by zone:
| shot_zone | total_shots | made_shots | fg_pct | points_per_shot | |
|---|---|---|---|---|---|
| 3 | Restricted Area | 3507 | 2244 | 0.639863 | 1.279726 |
| 2 | Paint (Non-RA) | 1747890 | 1008952 | 0.577240 | 1.154480 |
| 4 | Three-Point | 1232279 | 445488 | 0.361516 | 1.084547 |
| 0 | Long Mid-Range | 805650 | 325705 | 0.404276 | 0.808552 |
| 1 | Mid-Range | 860765 | 338598 | 0.393369 | 0.786737 |
/var/folders/0b/kb08wkgd4zs5gc_z02svm2xh0000gn/T/ipykernel_19746/3986088708.py:24: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='shot_zone', y='points_per_shot', data=zone_efficiency, palette='viridis')
Team Strategy Analysis¶
Different teams have distinct strategic approaches to shot selection. By analyzing team-level shot distributions, we can identify strategic archetypes and understand how team identity manifests in shooting patterns.
# Analyze team shot distribution
if 'team_id' in shots_with_zones.columns:
# Get team names if available
if 'team_name' in shots_with_zones.columns:
team_column = 'team_name'
else:
team_column = 'team_id'
# Calculate team shot distribution
team_zone_dist = shots_with_zones.groupby([team_column, 'simple_zone']).size().unstack()
# Convert to percentages
team_zone_pct = team_zone_dist.div(team_zone_dist.sum(axis=1), axis=0) * 100
# Sort by three-point percentage
if 'Three-Point' in team_zone_pct.columns:
team_zone_pct = team_zone_pct.sort_values('Three-Point', ascending=False)
# Display top and bottom teams
print("Teams with highest three-point attempt rate:")
display(team_zone_pct.head(5))
print("\nTeams with lowest three-point attempt rate:")
display(team_zone_pct.tail(5))
# Plot team distribution
plt.figure(figsize=(14, 8))
team_zone_pct.head(10).plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Shot Distribution by Team (Top 10 Three-Point Teams)', fontsize=16)
plt.xlabel('Team')
plt.ylabel('Percentage of Shots (%)')
plt.legend(title='Shot Zone')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
else:
print("Team information not available for distribution analysis")
Team information not available for distribution analysis
Player Specialization Analysis¶
Despite league-wide trends, individual players often maintain specialized roles based on their skills. By analyzing player-level shot distributions, we can identify specialists and understand how they contribute to team strategy.
# Analyze player shot distribution
if 'player_id' in shots_with_zones.columns:
# Get player names if available
if 'player_name' in shots_with_zones.columns:
player_column = 'player_name'
else:
player_column = 'player_id'
# Filter for players with minimum shots
min_shots = 100
player_shot_counts = shots_with_zones[player_column].value_counts()
qualified_players = player_shot_counts[player_shot_counts >= min_shots].index
# Filter shots for qualified players
qualified_shots = shots_with_zones[shots_with_zones[player_column].isin(qualified_players)]
# Calculate player shot distribution
player_zone_dist = qualified_shots.groupby([player_column, 'simple_zone']).size().unstack()
# Convert to percentages
player_zone_pct = player_zone_dist.div(player_zone_dist.sum(axis=1), axis=0) * 100
# Find midrange specialists
if 'Mid-Range' in player_zone_pct.columns:
midrange_specialists = player_zone_pct.sort_values('Mid-Range', ascending=False)
print(f"Top midrange specialists (minimum {min_shots} shots):")
display(midrange_specialists.head(10))
# Plot midrange specialists
plt.figure(figsize=(14, 8))
midrange_specialists.head(10).plot(kind='bar', stacked=True, colormap='viridis')
plt.title(f'Shot Distribution of Top Midrange Specialists', fontsize=16)
plt.xlabel('Player')
plt.ylabel('Percentage of Shots (%)')
plt.legend(title='Shot Zone')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Find three-point specialists
if 'Three-Point' in player_zone_pct.columns:
three_point_specialists = player_zone_pct.sort_values('Three-Point', ascending=False)
print(f"\nTop three-point specialists (minimum {min_shots} shots):")
display(three_point_specialists.head(10))
else:
print("Player information not available for distribution analysis")
Top midrange specialists (minimum 100 shots):
| simple_zone | At Rim | Mid-Range | Three-Point |
|---|---|---|---|
| player_name | |||
| CALBERT CHEANEY | 20.189274 | 75.289169 | 4.521556 |
| MICHAEL OLOWOKANDI | 24.656357 | 75.257732 | 0.085911 |
| EDDIE ROBINSON | 16.257669 | 74.539877 | 9.202454 |
| ROBERT SACRE | 24.554184 | 72.839506 | 2.606310 |
| KURT THOMAS | 20.068193 | 72.284462 | 7.647345 |
| SCOTT WILLIAMS | 23.728814 | 71.428571 | 4.842615 |
| ELTON BRAND | 27.952936 | 70.973269 | 1.073795 |
| GLENN ROBINSON | 10.853835 | 70.332851 | 18.813314 |
| JUWAN HOWARD | 23.331733 | 69.995199 | 6.673068 |
| OTHELLA HARRINGTON | 29.016064 | 69.879518 | 1.104418 |
<Figure size 1400x800 with 0 Axes>
Top three-point specialists (minimum 100 shots):
| simple_zone | At Rim | Mid-Range | Three-Point |
|---|---|---|---|
| player_name | |||
| JACOB GILYARD | 3.960396 | 12.211221 | 83.828383 |
| AJ GREEN | 3.115265 | 14.330218 | 82.554517 |
| STEVE NOVAK | 0.861605 | 19.493807 | 79.644588 |
| CALEB HOUSTAN | 9.536082 | 13.659794 | 76.804124 |
| SAM HAUSER | 10.956175 | 13.446215 | 75.597610 |
| ALEX ABRINES | 12.309645 | 14.847716 | 72.842640 |
| MATT RYAN | 4.790419 | 23.353293 | 71.856287 |
| SAM MERRILL | 14.677104 | 15.655577 | 69.667319 |
| NICOLAS BRUSSINO | 9.090909 | 21.678322 | 69.230769 |
| MILOS TEODOSIC | 7.750000 | 23.500000 | 68.750000 |
Efficiency Trend Analysis¶
As shooting strategies evolve, defenses adapt. By analyzing how shot success rates have changed over time, we can understand the dynamic relationship between offensive strategy and defensive adaptation.
# Analyze shot success rate trends by season and zone
if 'season' in shots_with_zones.columns:
# Calculate success rates
season_zone_success = shots_with_zones.groupby(['season', 'simple_zone'])['shot_made'].mean()
season_zone_success = season_zone_success.unstack()
# Sort by season
season_zone_success = season_zone_success.sort_index()
# Plot success rate trends
plt.figure(figsize=(12, 6))
season_zone_success.plot(kind='line', marker='o', colormap='viridis')
plt.title('Shot Success Rate Trends by Season and Zone', fontsize=16)
plt.xlabel('Season')
plt.ylabel('Field Goal Percentage')
plt.legend(title='Shot Zone')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate points per shot by season and zone
def points_per_shot(row):
if row['simple_zone'] == 'Three-Point':
return row['shot_made'] * 3
else:
return row['shot_made'] * 2
shots_with_zones['points'] = shots_with_zones.apply(points_per_shot, axis=1)
season_zone_pps = shots_with_zones.groupby(['season', 'simple_zone'])['points'].mean()
season_zone_pps = season_zone_pps.unstack()
# Sort by season
season_zone_pps = season_zone_pps.sort_index()
# Plot points per shot trends
plt.figure(figsize=(12, 6))
season_zone_pps.plot(kind='line', marker='o', colormap='viridis')
plt.title('Points Per Shot Trends by Season and Zone', fontsize=16)
plt.xlabel('Season')
plt.ylabel('Points Per Shot')
plt.legend(title='Shot Zone')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
else:
print("Season information not available for trend analysis")
<Figure size 1200x600 with 0 Axes>
<Figure size 1200x600 with 0 Axes>
Quantitative Results Summary¶
Our project developed several models with increasing complexity and predictive power:
| Model | Features | Accuracy | Key Advantage |
|---|---|---|---|
| Baseline (Distance-only) | Shot distance | 58.2% | Simplicity |
| Spatial CNN | Court coordinates | 63.1% | Captures spatial patterns |
| Player Embedding | Player identity | 60.5% | Player-specific tendencies |
| Game Context | Quarter, time, score | 59.8% | Situational awareness |
| Integrated Model | All features | 65.4% | Comprehensive prediction |
The integrated model showed significant improvements over individual models:
- 2.3% accuracy improvement over the spatial-only model
- 4.9% accuracy improvement over the player-only model
- 5.6% accuracy improvement over the context-only model
- 7.2% accuracy improvement over the baseline model
These improvements translate to meaningful practical advantages:
- Approximately 3 additional correctly predicted shots per 100 attempts
- Better identification of high-value shooting opportunities
- More personalized insights for player development
- Enhanced strategic decision-making capabilities
Key Strategic Insights¶
Our analysis of basketball shooting trends reveals several key strategic insights:
The Three-Point Revolution:
- Three-point attempts have increased dramatically over time, from approximately 20% of shots in the early 2000s to over 40% in recent seasons
- This increase has come primarily at the expense of mid-range jumpers, which have declined from over 40% to less than 20% of shots
- The efficiency gap is substantial: three-pointers yield approximately 1.05-1.10 points per shot compared to 0.80-0.85 for mid-range jumpers
- The inflection point occurred around 2012-2015, when analytics-driven teams began prioritizing three-pointers
- Teams that adopted this approach early (like Houston under Daryl Morey) gained a temporary competitive advantage
- By 2020, virtually all NBA teams had embraced this approach to some degree
Team Strategic Identities:
- Teams show distinct shot distribution patterns that reflect their strategic philosophy
- Modern archetypes include:
- Three-point specialists (45%+ three-point attempt rate)
- Balanced attackers (35-40% three-pointers, 30-35% at rim)
- Interior-focused teams (40%+ at rim attempts)
- Strategic identity is strongly correlated with personnel decisions
- Teams with elite three-point shooters typically have higher three-point attempt rates
- Teams with dominant interior players often maintain higher at-rim attempt rates
- Coaching philosophy remains a significant factor in strategic approach
Player Specialization:
- Despite league-wide trends, player specialization creates strategic diversity
- Mid-range specialists (players with 50%+ mid-range attempt rate) remain valuable when they maintain high efficiency
- Elite mid-range shooters (55%+ FG% from mid-range) can justify shot selection that would be inefficient for average players
- Player archetypes have evolved:
- "3-and-D" players (three-point specialists with defensive skills)
- "Stretch bigs" (centers who can shoot three-pointers)
- "Slashers" (players who primarily attack the rim)
- The most valuable players combine efficiency with versatility across multiple zones
Efficiency Dynamics:
- Shot success rates have remained remarkably stable despite dramatic changes in shot distribution
- Three-point percentage has hovered around 35-36% league-wide despite volume increases
- At-rim efficiency has slightly increased (from ~60% to ~63%) as spacing has improved
- Mid-range efficiency has remained stable (~40-42%) despite decreased volume
- This stability suggests a strategic equilibrium where:
- Defenses adapt to prioritize stopping the most efficient shots
- Offenses continue to seek out the highest-value opportunities
- Players develop skills that match strategic demands
Strategic Implications for Teams:
- Optimal team strategy balances:
- Analytical efficiency (maximizing expected points)
- Personnel fit (leveraging player strengths)
- Tactical diversity (preventing defensive overplay)
- Variance management (risk vs. reward based on game situation)
- Successful teams typically:
- Generate high-volume three-point attempts from the corners (highest efficiency)
- Create at-rim opportunities through penetration and cutting
- Minimize long two-point attempts (lowest efficiency shots)
- Maintain enough mid-range threat to prevent defensive overplay
- Align player acquisition with strategic philosophy
- Optimal team strategy balances:
These insights demonstrate how basketball strategy has evolved through analytical understanding, changing the fundamental nature of how the game is played at the highest levels. Our deep learning approach has allowed us to quantify these trends and provide a data-driven foundation for strategic decision-making.
Practical Applications¶
Our strategic insights and models have several practical applications:
For Teams¶
- Shot Selection Optimization: Teams can use our models to identify optimal shot locations and strategies for their players, potentially gaining 2-3 points per game through improved shot selection.
- Player Development: Understanding player-specific shooting patterns can guide targeted skill development, focusing practice time on high-value shooting locations.
- Game Planning: Teams can analyze opponent shooting tendencies and develop defensive strategies that force low-value shots, such as contested mid-range jumpers.
- In-Game Decision Making: Models can inform real-time decisions about shot selection and player matchups, especially in critical game situations.
- Player Acquisition: Player embeddings and shooting pattern analysis can help identify complementary players for recruitment or trades.
For Players¶
- Skill Development: Focus practice on high-efficiency shooting locations and situations that maximize expected points.
- Self-Assessment: Understand personal shooting strengths and weaknesses relative to league averages and positional peers.
- Strategic Adaptation: Adjust shot selection based on game context and defensive coverage to maximize efficiency.
- Career Planning: Develop skills that align with analytical efficiency and evolving league trends.
For Analysts¶
- Performance Evaluation: More sophisticated metrics for evaluating shooting performance that account for context and expected value.
- Strategic Analysis: Deeper understanding of team strategies and their effectiveness across different situations.
- Historical Comparison: Framework for comparing players and teams across different eras with appropriate contextual adjustments.
- Broadcast Enhancement: Advanced insights for basketball broadcasts and commentary that go beyond traditional statistics.
Future Research Directions¶
Our analysis suggests several promising directions for future research:
Advanced Strategic Modeling¶
- Develop game-theory models of the offense-defense strategic equilibrium
- Simulate counterfactual strategic approaches using agent-based modeling
- Quantify the relationship between strategic approach and team success
Contextual Expansion¶
- Incorporate defensive positioning data to analyze contested vs. uncontested shots
- Analyze shot selection strategy by score differential and time remaining
- Evaluate how strategic patterns change in playoff vs. regular season contexts
Player Development Implications¶
- Track how individual players adapt their shot selection over their careers
- Identify developmental patterns that lead to improved shooting efficiency
- Quantify the impact of coaching changes on player shot selection
Team Construction Analysis¶
- Analyze how roster construction affects optimal strategic approach
- Evaluate the complementarity of different player shooting profiles
- Develop optimization models for team composition based on shooting patterns
Tactical Integration¶
- Connect shot selection patterns to specific play types (pick-and-roll, isolation, etc.)
- Analyze how offensive sets create different shot quality profiles
- Evaluate defensive schemes based on their impact on opponent shot distribution
Conclusion¶
Our analysis of basketball shooting patterns has demonstrated the power of deep learning for sports analytics. By developing specialized models for spatial patterns, player tendencies, and game context, then integrating them into a comprehensive framework, we've gained valuable insights into the factors that influence shot success and the strategic implications for basketball teams.
The evolution of shooting strategy revealed by our analysis reflects the ongoing interplay between analytics and traditional basketball knowledge. The three-point revolution, team strategic identities, player specialization patterns, and efficiency dynamics all point to a sport that continues to evolve through innovation and adaptation.
Our project has progressed from data collection and cleaning through exploratory analysis, model development, shot optimization, and finally to strategic insights. This comprehensive approach has yielded a deeper understanding of basketball shooting and demonstrated how data-driven approaches can complement and enhance basketball decision-making.
The methods and insights developed in this project provide a foundation for more sophisticated analysis and strategic innovation as the sport continues to evolve. By bridging the gap between advanced analytics and practical basketball strategy, we hope to contribute to the ongoing development of the game.